The data update path doesn't need a printbuf for its log message - this
will help reduce stack usage.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
There can be a lot of rendundancy in accounting updates within a single
btree transaction.
Split out accounting updates so that they can be deduped, in the next
commit.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Allow btree_insert_entry.ip_allocated to be passed in, so we get better
info on where alloc updates are coming from.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We're occasionally seeing the WARN_ON() for bump allocator usage
exceeding BTREE_TRANS_MEM_MAX; add some tracing so we can see what's
going on.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Add a journal entry type for logging - but logging a bkey, not a string;
to be used for data move path debugging.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We just had a report of the assert for "btree in write buffer for
non-write buffer btree" popping during the 6.14 upgrade.
- 150TB filesystem, after a reboot the upgrade was able to continue from
where it left off, so no major damage.
But with 6.14 about to come out we want to get this tracked down asap,
and need more data if other users hit this.
Convert the BUG_ON() to an emergency read-only, and print out btree, the
key itself, and stack trace from the original write buffer update (which
did not have this check before).
Reported-by: Stijn Tintel <stijn@linux-ipv6.be>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We unconditionally go read-write, if we're going to do so, before
journal replay: lazy_rw is obsolete.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Using commit_do() to call alloc_sectors_start_trans() breaks when we're
randomly injecting transaction restarts - the restart in the commit
causes us to leak the lock that alloc_sectorS_start_trans() takes.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This fixes a kasan splat in propagate_key_to_snapshot_leaves() -
varint_decode_fast() does reads (that it never uses) up to 7 bytes past
the end of the integer.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The commit 65bd442397 ("bcachefs: bch2_btree_insert_trans() no longer
specifies BTREE_ITER_cached") removes BTREE_ITER_cached from
bch2_btree_insert_trans, which causes the update_inode function from
bcachefs-tools to take a long time (~20s). Add an iter_flags parameter
to bch2_btree_insert, so the users can specify iter update trigger
flags, such as BTREE_ITER_cached.
Signed-off-by: Ariel Miculas <ariel.miculas@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Main part of the disk accounting rewrite.
This is a wholesale rewrite of the existing disk space accounting, which
relies on percepu counters that are sharded by journal buffer, and
rolled up and added to each journal write.
With the new scheme, every set of counters is a distinct key in the
accounting btree; this fixes scaling limitations of the old scheme,
where counters took up space in each journal entry and required multiple
percpu counters.
Now, in memory accounting requires a single set of percpu counters - not
multiple for each in flight journal buffer - and in the future we'll
probably also have counters that don't use in memory percpu counters,
they're not strictly required.
An accounting update is now a normal btree update, using the btree write
buffer path. At transaction commit time, we apply accounting updates to
the in memory counters, which are percpu counters indexed in an
eytzinger tree by the accounting key.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Until accounting keys hit the btree, they are deltas, not new versions
of the existing key; this means we have to teach journal replay to
accumulate them.
Additionally, the journal doesn't track precisely which entries have
been flushed to the btree; it only tracks a range of entries that may
possibly still need to be flushed.
That means we need to compare accounting keys against the version in the
btree and only flush updates that are newer.
There's another wrinkle with the write buffer: if the write buffer
starts flushing accounting keys before journal replay has finished
flushing accounting keys, journal replay will see the version number
from the new updates and updates from the journal will be lost.
To avoid this, journal replay has to flush accounting keys first, and
we'll be adding a flag so that write buffer flush knows to hold
accounting keys until then.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
New on disk format version for bch_alloc->stripe_sectors and
BCH_DATA_unstriped - accounting for unstriped data in stripe buckets.
Upgrade/downgrade requires regenerating alloc info - but only if erasure
coding is in use.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Combine iter/update/trigger/str_hash flags into a single enum, and
x-macroize them for a to_text() function later.
These flags are all for a specific iter/key/update context, so it makes
sense to group them together - iter/update/trigger flags were already
given distinct bits, this cleans up and unifies that handling.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Provide a non-write buffer version of bch2_btree_bit_mod_buffered(), for
the subvolume children btree.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Instead of using a darray, we now allocate journal entries for the
transaction commit path with our normal bump allocator - with an inlined
fastpath, and using btree_transaction_stats to remember how much to
initially allocate so as to avoid transaction restarts.
This is prep work for converting write buffer updates to use this
mechanism.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Create a separate enum for str_hash flags - instead of abusing the
btree_insert_flags enum - and create a __bitwise typedef for sparse
typechecking.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
./fs/bcachefs/btree_update.h: journal.h is included more than once.
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=6573
Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We're using more stack than we'd like in a number of functions, and
btree_trans is the biggest object that we stack allocate.
But we have to do a heap allocatation to initialize it anyways, so
there's no real downside to heap allocating the entire thing.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Add a new btree for long running logged operations - i.e. for logging
operations that we can't do within a single btree transaction, so that
they can be resumed if we crash.
Keys in the logged operations btree will represent operations in
progress, with the state of the operation stored in the value.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Introduce support for prejournaled key updates. This allows a
transaction to commit an update for a key that already exists (and
is pinned) in the journal. This is required for btree write buffer
updates as the current scheme of journaling both on write buffer
insertion and write buffer (slow path) flush is unsafe in certain
crash recovery scenarios.
Create a small trans update wrapper to pass along the seq where the
key resides into the btree_insert_entry. From there, trans commit
passes the seq into the btree insert path where it is used to manage
the journal pin for the associated btree leaf.
Note that this patch only introduces the underlying mechanism and
otherwise includes no functional changes.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Now that we have journal watermarks and alloc watermarks unified,
BTREE_INSERT_USE_RESERVE is redundant and can be deleted.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This unifies JOURNAL_WATERMARK with BCH_WATERMARK; we're working towards
specifying watermarks once in the transaction commit path.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
bch2_bkey_make_mut() now takes the bkey_s_c by reference and points it
at the new, mutable key.
This helps in some fsck paths that may have multiple repair operations
on the same key.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
When partially overwriting an extent in an older snapshot, the existing
extent has to be split.
If the existing extent was overwritten in a different (sibling)
snapshot, we have to ensure that the split won't be visible in the
sibling snapshot.
data_update.c already has code for this,
bch2_insert_snapshot_writeouts() - we just need to move it into
btree_update_leaf.c and change bch2_trans_update_extent() to use it as
well.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>