This fixes exceeding the bump allocator limit when the allocator finds
many buckets that need repair - they're repaired asynchronously, which
means that every error logged a message in the bump allocator, without
committing.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We need to start searching from search_key - _not_ path->pos, which will
point to the key we found in the btree
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Now the alloc_req is allocated from the bump allocator, if there is
reallocation, the memory of alloc_req would be frees, fix by delaying the
reallocation to transaction restart, it has to restart anyway.
Reported-by: syzbot+2887a13a5c387e616a68@syzkaller.appspotmail.com
Signed-off-by: Alan Huang <mmpgouride@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Allocating new memory when mempool is exhausted is too complicated, just
return ENOMEM is fine. memcpy is not needed, since there might be
pointers point to the old memory, that's the bug.
Signed-off-by: Alan Huang <mmpgouride@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We've been seeing some livelock-ish behavior in the index update part of
the main write path, and while we've got low level btree path
tracepoints, we've been lacking high level btree iterator tracepoints.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The new guard(), scoped_guard() allow for more natural code.
Some of the uses with creative flow control have been left.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Allocate some (smaller) temporary storage in btree_trans for this -
btree_path_down() is in our max-stack call stack.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
If path->should_be_locked is true, that means user code (of the btree
API) has seen, in this transaction, something guarded by the node this
path has locked, and we have to keep it locked until the end of the
transaction.
Assert that we're not violating this; should_be_locked should also be
cleared only in _very_ special situations.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Avoid transaction restarts due to failure to upgrade - we can traverse a
new iterator without a transaction restart.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
btree_path_get_locks, on failure, shouldn't unlock if we're not issuing
a transaction restart: we might drop locks we're not supposed to (if
path->should_be_locked is set).
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
bch2_path_put_nokeep() was intended for paths we wouldn't need to
preserve for a transaction restart - it always frees them right away
when the ref hits 0.
But since paths are shared, freeing unconditionally is a bug, the path
might have been used elsewhere and have should_be_locked set, i.e. we
need to keep it locked until the end of the transaction.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
More error message cleanup: instead of multiple printk()s per error, we
want to be building up a single error message in a printbuf, so that it
can be printed with indenting that shows grouping and avoid errors
getting interspersed or lost in the log.
This gets rid of most calls to bch2_fs_emergency_read_only(). We still
have calls to
- bch2_fatal_error()
- bch2_fs_fatal_error()
- bch2_fs_fatal_err_on()
that need work.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We'd like users to be able to debug without building custom kernels, so
this will help us get rid of CONFIG_BCACHEFS_DEBUG, at least for most
things.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
bch2_print_string_as_lines() is a low level helper that allows messages
longer than 1k to be printed without truncation.
But we should always be printing with the helpers that take a filesystem
object, if we're in fsck they direct output to the userspace process
controlling fsck instead of the dmesg log.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We're occasionally seeing the WARN_ON() for bump allocator usage
exceeding BTREE_TRANS_MEM_MAX; add some tracing so we can see what's
going on.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
btree_key_cache_fill() will allocate and traverse another path (for the
underlying btree), so we can't hold pointers to paths across a call - we
have to pass indices.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
set_should_be_locked() needs to be called before peek_key_cache(), which
traverses other paths and may do a trans unlock/relock.
This fixes an assertion pop in path_peek_slot(), when the path we're
using is unexpectedly not uptodate.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This fixes btree locking assert pops users were seeing during evacuate:
https://github.com/koverstreet/bcachefs/issues/878
May 09 22:45:02 sharon kernel: bcachefs (68116e25-fa2d-4c6f-86c7-e8b431d792ae): bch2_btree_insert_node(): node not locked at level 1
May 09 22:45:02 sharon kernel: bch2_btree_node_rewrite [bcachefs]: watermark=btree no_check_rw alloc l=0-1 mode=none nodes_written=0 cl.remaining=2 journal_seq=0
May 09 22:45:02 sharon kernel: path: idx 1 ref 1:0 S B btree=alloc level=0 pos 0:3699637:0 0:3698012:1-0:3699637:0 bch2_move_btree.isra.0+0x1db/0x490 [bcachefs] uptodate 0 locks_want 2
May 09 22:45:02 sharon kernel: l=0 locks intent seq 4 node ffff8bd700c93600
May 09 22:45:02 sharon kernel: l=1 locks unlocked seq 1712 node ffff8bd6fd5e7a00
May 09 22:45:02 sharon kernel: l=2 locks unlocked seq 2295 node ffff8bd6cc725400
May 09 22:45:02 sharon kernel: l=3 locks unlocked seq 0 node 0000000000000000
Evacuate walks btree nodes with bch2_btree_iter_next_node() and rewrites
them, bch2_btree_update_start() upgrades the path to take intent locks
as far as it needs to.
But next_node() does low level unlock/relock calls on individual nodes,
and didn't handle the case where a path is supposed to be holding
multiple intent locks. If a path has locks_want > 1, it needs to be
either holding locks on all the btree nodes (at each level) requested,
or none of them.
Fix this with a bch2_btree_path_downgrade().
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
At the end of the inode, on an extents iterator, peek_slot() has to
advance to the next position to avoid returning a 0 size extent, which
is not allowed.
Changing iter->pos confuses peek_prev(), but we don't need to call
peek_slot() in this case.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The issue this assert is guarding against is that in
BTREE_ITER_filter_snapshots mode we only want to be iterating within a
single inode number - if we iterate into another inode number with keys
for a different snapshot tree, we'll loop arbitrarily long before
finding a key we can return.
This comes up in the unit tests, where we're using inode 0 for our test
keys.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This was planned to be done ages ago, now finally completed; there are
places where we have quite a few btree_trans objects on the stack, so
this reduces stack usage somewhat.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Build up and emit the error message for an inconsistency error all at
once, instead of spread over multiple printk calls, so they're not
jumbled in the dmesg log.
Also, add better indenting.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Incorrectly handled transaction restarts can be a source of heisenbugs;
add a mode where we randomly inject them to shake them out.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
BTREE_ITER_cached_nofill has some tricky corner cases; it's used
internally for iterators that aren't walking the key cache, but need to
be coherent with the key cache.
It tells traverse to look up and lock the key cache entry if present,
but don't create one if it doesn't exist.
That means we have to have a BTREE_ITER_UPTODATE path (because after
traverse the path has to be UPTODATE, or we pop assertions) that doesn't
point to anything (which is the less bad option, taken by the previous
fix).
The previous fix for this path missed an issue that can happen in
bch2_trans_peek_key_cache(): we can't set should_be_locked on a path
that doesn't point to anything and doesn't hold locks.
Fixes: bd5b09727f ("bcachefs: Don't set btree_path to updtodate if we don't fill")
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
bch2_btree_iter_flags() now takes a level parameter; this fixes a bug
where using a node iterator on a leaf wouldn't set
BTREE_ITER_with_key_cache, leading to fun cache coherency bugs.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Normally, whitouts (KEY_TYPE_whitout) are filtered from btree lookups,
since they exist only to represent deletions of keys in ancestor
snapshots - except, they should not be filtered in
BTREE_ITER_all_snapshots mode, so that e.g. snapshot deletion can clean
them up.
This means that that the key cache has to store whiteouts, and key cache
fills cannot filter them.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
In BTREE_ITER_all_snapshots mode, we're required to only return keys
where the snapshot field matches the iterator position -
BTREE_ITER_filter_snapshots requires pulling keys into the key cache
from ancestor snapshots, so we have to check for that.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Historically, we required that all btree node roots point to a valid
(possibly fake) node, but we're improving our ability to continue in the
presence of errors.
Reported-by: syzbot+e22007d6acb9c87c2362@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
To help ameloriate issues with peek operations having to skip over
deletions in the journal - just bail out if all we're doing is
prefetching btree nodes.
Since btree node prefetching runs every time we iterate to a new node,
and has to sequentially scan ahead, this avoids another O(n^2).
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
With extents and snapshots, for slightly different reasons, we may have
to search forwards to find a key that compares equal to iter->pos (i.e.
a key that peek_prev() should return, as it returns keys <= iter->pos).
peek_slot() does this, and is an easy way to fix this case.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
A user contributed a filessytem dump, where the dump was actually
corrupted (due to being taken while the filesystem was online), but
which exposed an interesting bug in fsck - reconstruct_inode().
When itearting in BTREE_ITER_filter_snapshots mode, it's required to
give an end position for the iteration and it can't span inode numbers;
continuing into the next inode might mean we start seeing keys from a
different snapshot tree, that the is_ancestor() checks always filter,
thus we're never able to return a key and stop iterating.
Backwards iteration never implemented the end position because nothing
else needed it - except for reconstuct_inode().
Additionally, backwards iteration is now able to overlay keys from the
journal, which will be useful if we ever decide to start doing journal
replay in the background.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Collapse all the BTREE_ITER_filter_snapshots handling down into a single
block; btree iteration is much simpler in the !filter_snapshots case.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>