bcachefs fixes for 6.14-rc2

- add a SubmittingPatches to clarify that patches submitted for bcachefs
   do, in fact, need to be tested
 - discard path now correctly issues journal flushes when needed, this
   fixes performance issues when the filesystem is nearly full and we're
   bottlenecked on copygc
 - fix a bug that could cause the pending rebalance work accounting to be
   off when devices are being onlined/offlined; users should report if
   they are still seeing this
 - and a few more trivial ones
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAmelgKQACgkQE6szbY3K
 bnY5gg/9EETvYT1Mdw2a0c/OhT61qxbLUO3T4FAfPRtBUi72Y0OP05FUf5jZFLw8
 dmxK8qpZVW3Xs5FORMvezEyVMnu3FmlUcTiGaJsYxAGQnPB/IFG5c2sJYeznDJP7
 rX5vOHUVJYfrcfukNZZzqB9mhC8VFO8BhXGggT1IFanc6Fnkoys3YnWS6F6a9oVa
 Lv67aSi2uJVQXbneN3r5iFxh5DzgyJetm1v2g0VPQGif4zfOqhSdBUFnbJ42It3Y
 1N7//fVh94cS+DTxSDQaqNNVj3k9FI9aJiFWs9p08VNMQ3OlIloaNTrythUtRVF1
 zkSwwMoVVuHwmtaf+5IujetXO3RoSoIMUGoOMXH6UIh9TQioFoXxBrxFp7X61/q3
 NVk9hi2U5WYkGXDWOHbVCMBFVAbY2MckHrkM0+GXG4ICVmSmVtFXVhyb4Dpeich9
 uvR4lTMoxa4lTiAoesIYa4puhAFnq27MCcq5c3DXbT63efU3eU7GCeSWUT85ZqfV
 nHmm/ZaQJClN66K+ikK/L29MjvzlmWxiGTF0Tk5JusiHbbzKIcDZsnQNonGzoPKk
 pbR1FWNl/mX1P9gk6lOLJMscSIrOKWWC/wI1pI6x3x84gvtG0oZW8Zg+J3XkexlD
 NzhFkOkf1GZAvg8eYklFlzAS1+14pBxgnibg9fbsDqPG74RlNRM=
 =yics
 -----END PGP SIGNATURE-----

Merge tag 'bcachefs-2025-02-06.2' of git://evilpiepirate.org/bcachefs

Pull bcachefs fixes from Kent Overstreet:
 "Nothing major, things continue to be fairly quiet over here.

   - add a SubmittingPatches to clarify that patches submitted for
     bcachefs do, in fact, need to be tested

   - discard path now correctly issues journal flushes when needed, this
     fixes performance issues when the filesystem is nearly full and
     we're bottlenecked on copygc

   - fix a bug that could cause the pending rebalance work accounting to
     be off when devices are being onlined/offlined; users should report
     if they are still seeing this

   - and a few more trivial ones"

* tag 'bcachefs-2025-02-06.2' of git://evilpiepirate.org/bcachefs:
  bcachefs: bch2_bkey_sectors_need_rebalance() now only depends on bch_extent_rebalance
  bcachefs: Fix rcu imbalance in bch2_fs_btree_key_cache_exit()
  bcachefs: Fix discard path journal flushing
  bcachefs: fix deadlock in journal_entry_open()
  bcachefs: fix incorrect pointer check in __bch2_subvolume_delete()
  bcachefs docs: SubmittingPatches.rst
This commit is contained in:
Linus Torvalds 2025-02-07 09:16:07 -08:00
commit 94b481f767
20 changed files with 215 additions and 59 deletions

View file

@ -0,0 +1,98 @@
Submitting patches to bcachefs:
===============================
Patches must be tested before being submitted, either with the xfstests suite
[0], or the full bcachefs test suite in ktest [1], depending on what's being
touched. Note that ktest wraps xfstests and will be an easier method to running
it for most users; it includes single-command wrappers for all the mainstream
in-kernel local filesystems.
Patches will undergo more testing after being merged (including
lockdep/kasan/preempt/etc. variants), these are not generally required to be
run by the submitter - but do put some thought into what you're changing and
which tests might be relevant, e.g. are you dealing with tricky memory layout
work? kasan, are you doing locking work? then lockdep; and ktest includes
single-command variants for the debug build types you'll most likely need.
The exception to this rule is incomplete WIP/RFC patches: if you're working on
something nontrivial, it's encouraged to send out a WIP patch to let people
know what you're doing and make sure you're on the right track. Just make sure
it includes a brief note as to what's done and what's incomplete, to avoid
confusion.
Rigorous checkpatch.pl adherence is not required (many of its warnings are
considered out of date), but try not to deviate too much without reason.
Focus on writing code that reads well and is organized well; code should be
aesthetically pleasing.
CI:
===
Instead of running your tests locally, when running the full test suite it's
prefereable to let a server farm do it in parallel, and then have the results
in a nice test dashboard (which can tell you which failures are new, and
presents results in a git log view, avoiding the need for most bisecting).
That exists [2], and community members may request an account. If you work for
a big tech company, you'll need to help out with server costs to get access -
but the CI is not restricted to running bcachefs tests: it runs any ktest test
(which generally makes it easy to wrap other tests that can run in qemu).
Other things to think about:
============================
- How will we debug this code? Is there sufficient introspection to diagnose
when something starts acting wonky on a user machine?
We don't necessarily need every single field of every data structure visible
with introspection, but having the important fields of all the core data
types wired up makes debugging drastically easier - a bit of thoughtful
foresight greatly reduces the need to have people build custom kernels with
debug patches.
More broadly, think about all the debug tooling that might be needed.
- Does it make the codebase more or less of a mess? Can we also try to do some
organizing, too?
- Do new tests need to be written? New assertions? How do we know and verify
that the code is correct, and what happens if something goes wrong?
We don't yet have automated code coverage analysis or easy fault injection -
but for now, pretend we did and ask what they might tell us.
Assertions are hugely important, given that we don't yet have a systems
language that can do ergonomic embedded correctness proofs. Hitting an assert
in testing is much better than wandering off into undefined behaviour la-la
land - use them. Use them judiciously, and not as a replacement for proper
error handling, but use them.
- Does it need to be performance tested? Should we add new peformance counters?
bcachefs has a set of persistent runtime counters which can be viewed with
the 'bcachefs fs top' command; this should give users a basic idea of what
their filesystem is currently doing. If you're doing a new feature or looking
at old code, think if anything should be added.
- If it's a new on disk format feature - have upgrades and downgrades been
tested? (Automated tests exists but aren't in the CI, due to the hassle of
disk image management; coordinate to have them run.)
Mailing list, IRC:
==================
Patches should hit the list [3], but much discussion and code review happens on
IRC as well [4]; many people appreciate the more conversational approach and
quicker feedback.
Additionally, we have a lively user community doing excellent QA work, which
exists primarily on IRC. Please make use of that resource; user feedback is
important for any nontrivial feature, and documenting it in commit messages
would be a good idea.
[0]: git://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git
[1]: https://evilpiepirate.org/git/ktest.git/
[2]: https://evilpiepirate.org/~testdashboard/ci/
[3]: linux-bcachefs@vger.kernel.org
[4]: irc.oftc.net#bcache, #bcachefs-dev

View file

@ -9,4 +9,5 @@ bcachefs Documentation
:numbered:
CodingStyle
SubmittingPatches
errorcodes

View file

@ -3954,6 +3954,7 @@ M: Kent Overstreet <kent.overstreet@linux.dev>
L: linux-bcachefs@vger.kernel.org
S: Supported
C: irc://irc.oftc.net/bcache
P: Documentation/filesystems/bcachefs/SubmittingPatches.rst
T: git https://evilpiepirate.org/git/bcachefs.git
F: fs/bcachefs/
F: Documentation/filesystems/bcachefs/

View file

@ -1803,7 +1803,6 @@ struct discard_buckets_state {
u64 open;
u64 need_journal_commit;
u64 discarded;
u64 need_journal_commit_this_dev;
};
static int bch2_discard_one_bucket(struct btree_trans *trans,
@ -1827,11 +1826,11 @@ static int bch2_discard_one_bucket(struct btree_trans *trans,
goto out;
}
if (bch2_bucket_needs_journal_commit(&c->buckets_waiting_for_journal,
c->journal.flushed_seq_ondisk,
pos.inode, pos.offset)) {
s->need_journal_commit++;
s->need_journal_commit_this_dev++;
u64 seq_ready = bch2_bucket_journal_seq_ready(&c->buckets_waiting_for_journal,
pos.inode, pos.offset);
if (seq_ready > c->journal.flushed_seq_ondisk) {
if (seq_ready > c->journal.flushing_seq)
s->need_journal_commit++;
goto out;
}
@ -1865,23 +1864,24 @@ static int bch2_discard_one_bucket(struct btree_trans *trans,
discard_locked = true;
}
if (!bkey_eq(*discard_pos_done, iter.pos) &&
ca->mi.discard && !c->opts.nochanges) {
/*
* This works without any other locks because this is the only
* thread that removes items from the need_discard tree
*/
bch2_trans_unlock_long(trans);
blkdev_issue_discard(ca->disk_sb.bdev,
k.k->p.offset * ca->mi.bucket_size,
ca->mi.bucket_size,
GFP_KERNEL);
*discard_pos_done = iter.pos;
if (!bkey_eq(*discard_pos_done, iter.pos)) {
s->discarded++;
*discard_pos_done = iter.pos;
ret = bch2_trans_relock_notrace(trans);
if (ret)
goto out;
if (ca->mi.discard && !c->opts.nochanges) {
/*
* This works without any other locks because this is the only
* thread that removes items from the need_discard tree
*/
bch2_trans_unlock_long(trans);
blkdev_issue_discard(ca->disk_sb.bdev,
k.k->p.offset * ca->mi.bucket_size,
ca->mi.bucket_size,
GFP_KERNEL);
ret = bch2_trans_relock_notrace(trans);
if (ret)
goto out;
}
}
SET_BCH_ALLOC_V4_NEED_DISCARD(&a->v, false);
@ -1929,6 +1929,9 @@ static void bch2_do_discards_work(struct work_struct *work)
POS(ca->dev_idx, U64_MAX), 0, k,
bch2_discard_one_bucket(trans, ca, &iter, &discard_pos_done, &s, false)));
if (s.need_journal_commit > dev_buckets_available(ca, BCH_WATERMARK_normal))
bch2_journal_flush_async(&c->journal, NULL);
trace_discard_buckets(c, s.seen, s.open, s.need_journal_commit, s.discarded,
bch2_err_str(ret));
@ -2024,7 +2027,7 @@ static void bch2_do_discards_fast_work(struct work_struct *work)
break;
}
trace_discard_buckets(c, s.seen, s.open, s.need_journal_commit, s.discarded, bch2_err_str(ret));
trace_discard_buckets_fast(c, s.seen, s.open, s.need_journal_commit, s.discarded, bch2_err_str(ret));
bch2_trans_put(trans);
percpu_ref_put(&ca->io_ref);

View file

@ -205,8 +205,12 @@ static inline bool may_alloc_bucket(struct bch_fs *c,
return false;
}
if (bch2_bucket_needs_journal_commit(&c->buckets_waiting_for_journal,
c->journal.flushed_seq_ondisk, bucket.inode, bucket.offset)) {
u64 journal_seq_ready =
bch2_bucket_journal_seq_ready(&c->buckets_waiting_for_journal,
bucket.inode, bucket.offset);
if (journal_seq_ready > c->journal.flushed_seq_ondisk) {
if (journal_seq_ready > c->journal.flushing_seq)
s->need_journal_commit++;
s->skipped_need_journal_commit++;
return false;
}
@ -570,7 +574,7 @@ alloc:
? bch2_bucket_alloc_freelist(trans, ca, watermark, &s, cl)
: bch2_bucket_alloc_early(trans, ca, watermark, &s, cl);
if (s.skipped_need_journal_commit * 2 > avail)
if (s.need_journal_commit * 2 > avail)
bch2_journal_flush_async(&c->journal, NULL);
if (!ob && s.btree_bitmap != BTREE_BITMAP_ANY) {

View file

@ -18,6 +18,7 @@ struct bucket_alloc_state {
u64 buckets_seen;
u64 skipped_open;
u64 skipped_need_journal_commit;
u64 need_journal_commit;
u64 skipped_nocow;
u64 skipped_nouse;
u64 skipped_mi_btree_bitmap;

View file

@ -748,7 +748,6 @@ void bch2_fs_btree_key_cache_exit(struct btree_key_cache *bc)
rcu_read_unlock();
mutex_lock(&bc->table.mutex);
mutex_unlock(&bc->table.mutex);
rcu_read_lock();
continue;
}
for (i = 0; i < tbl->size; i++)

View file

@ -22,23 +22,21 @@ static void bucket_table_init(struct buckets_waiting_for_journal_table *t, size_
memset(t->d, 0, sizeof(t->d[0]) << t->bits);
}
bool bch2_bucket_needs_journal_commit(struct buckets_waiting_for_journal *b,
u64 flushed_seq,
unsigned dev, u64 bucket)
u64 bch2_bucket_journal_seq_ready(struct buckets_waiting_for_journal *b,
unsigned dev, u64 bucket)
{
struct buckets_waiting_for_journal_table *t;
u64 dev_bucket = (u64) dev << 56 | bucket;
bool ret = false;
unsigned i;
u64 ret = 0;
mutex_lock(&b->lock);
t = b->t;
for (i = 0; i < ARRAY_SIZE(t->hash_seeds); i++) {
for (unsigned i = 0; i < ARRAY_SIZE(t->hash_seeds); i++) {
struct bucket_hashed *h = bucket_hash(t, i, dev_bucket);
if (h->dev_bucket == dev_bucket) {
ret = h->journal_seq > flushed_seq;
ret = h->journal_seq;
break;
}
}

View file

@ -4,8 +4,8 @@
#include "buckets_waiting_for_journal_types.h"
bool bch2_bucket_needs_journal_commit(struct buckets_waiting_for_journal *,
u64, unsigned, u64);
u64 bch2_bucket_journal_seq_ready(struct buckets_waiting_for_journal *,
unsigned, u64);
int bch2_set_bucket_needs_journal_commit(struct buckets_waiting_for_journal *,
u64, unsigned, u64, u64);

View file

@ -285,12 +285,14 @@ void bch2_inode_opts_get(struct bch_io_opts *, struct bch_fs *,
struct bch_inode_unpacked *);
int bch2_inum_opts_get(struct btree_trans*, subvol_inum, struct bch_io_opts *);
#include "rebalance.h"
static inline struct bch_extent_rebalance
bch2_inode_rebalance_opts_get(struct bch_fs *c, struct bch_inode_unpacked *inode)
{
struct bch_io_opts io_opts;
bch2_inode_opts_get(&io_opts, c, inode);
return io_opts_to_rebalance_opts(&io_opts);
return io_opts_to_rebalance_opts(c, &io_opts);
}
int bch2_inode_rm_snapshot(struct btree_trans *, u64, u32);

View file

@ -319,6 +319,16 @@ void bch2_journal_halt(struct journal *j)
spin_unlock(&j->lock);
}
void bch2_journal_halt_locked(struct journal *j)
{
lockdep_assert_held(&j->lock);
__journal_entry_close(j, JOURNAL_ENTRY_ERROR_VAL, true);
if (!j->err_seq)
j->err_seq = journal_cur_seq(j);
journal_wake(j);
}
static bool journal_entry_want_write(struct journal *j)
{
bool ret = !journal_entry_is_open(j) ||
@ -381,9 +391,12 @@ static int journal_entry_open(struct journal *j)
if (nr_unwritten_journal_entries(j) == ARRAY_SIZE(j->buf))
return JOURNAL_ERR_max_in_flight;
if (bch2_fs_fatal_err_on(journal_cur_seq(j) >= JOURNAL_SEQ_MAX,
c, "cannot start: journal seq overflow"))
if (journal_cur_seq(j) >= JOURNAL_SEQ_MAX) {
bch_err(c, "cannot start: journal seq overflow");
if (bch2_fs_emergency_read_only_locked(c))
bch_err(c, "fatal error - emergency read only");
return JOURNAL_ERR_insufficient_devices; /* -EROFS */
}
BUG_ON(!j->cur_entry_sectors);
@ -783,6 +796,7 @@ recheck_need_open:
}
buf->must_flush = true;
j->flushing_seq = max(j->flushing_seq, seq);
if (parent && !closure_wait(&buf->wait, parent))
BUG();

View file

@ -409,6 +409,7 @@ bool bch2_journal_noflush_seq(struct journal *, u64, u64);
int bch2_journal_meta(struct journal *);
void bch2_journal_halt(struct journal *);
void bch2_journal_halt_locked(struct journal *);
static inline int bch2_journal_error(struct journal *j)
{

View file

@ -237,6 +237,7 @@ struct journal {
/* seq, last_seq from the most recent journal entry successfully written */
u64 seq_ondisk;
u64 flushed_seq_ondisk;
u64 flushing_seq;
u64 last_seq_ondisk;
u64 err_seq;
u64 last_empty_seq;

View file

@ -659,18 +659,4 @@ static inline void bch2_io_opts_fixups(struct bch_io_opts *opts)
struct bch_io_opts bch2_opts_to_inode_opts(struct bch_opts);
bool bch2_opt_is_inode_opt(enum bch_opt_id);
/* rebalance opts: */
static inline struct bch_extent_rebalance io_opts_to_rebalance_opts(struct bch_io_opts *opts)
{
return (struct bch_extent_rebalance) {
.type = BIT(BCH_EXTENT_ENTRY_rebalance),
#define x(_name) \
._name = opts->_name, \
._name##_from_inode = opts->_name##_from_inode,
BCH_REBALANCE_OPTS()
#undef x
};
};
#endif /* _BCACHEFS_OPTS_H */

View file

@ -121,12 +121,10 @@ u64 bch2_bkey_sectors_need_rebalance(struct bch_fs *c, struct bkey_s_c k)
}
}
incompressible:
if (opts->background_target &&
bch2_target_accepts_data(c, BCH_DATA_user, opts->background_target)) {
if (opts->background_target)
bkey_for_each_ptr_decode(k.k, ptrs, p, entry)
if (!p.ptr.cached && !bch2_dev_in_target(c, p.ptr.dev, opts->background_target))
sectors += p.crc.compressed_size;
}
return sectors;
}
@ -140,7 +138,7 @@ static bool bch2_bkey_rebalance_needs_update(struct bch_fs *c, struct bch_io_opt
const struct bch_extent_rebalance *old = bch2_bkey_rebalance_opts(k);
if (k.k->type == KEY_TYPE_reflink_v || bch2_bkey_ptrs_need_rebalance(c, opts, k)) {
struct bch_extent_rebalance new = io_opts_to_rebalance_opts(opts);
struct bch_extent_rebalance new = io_opts_to_rebalance_opts(c, opts);
return old == NULL || memcmp(old, &new, sizeof(new));
} else {
return old != NULL;
@ -163,7 +161,7 @@ int bch2_bkey_set_needs_rebalance(struct bch_fs *c, struct bch_io_opts *opts,
k.k->u64s += sizeof(*old) / sizeof(u64);
}
*old = io_opts_to_rebalance_opts(opts);
*old = io_opts_to_rebalance_opts(c, opts);
} else {
if (old)
extent_entry_drop(k, (union bch_extent_entry *) old);

View file

@ -4,8 +4,28 @@
#include "compress.h"
#include "disk_groups.h"
#include "opts.h"
#include "rebalance_types.h"
static inline struct bch_extent_rebalance io_opts_to_rebalance_opts(struct bch_fs *c,
struct bch_io_opts *opts)
{
struct bch_extent_rebalance r = {
.type = BIT(BCH_EXTENT_ENTRY_rebalance),
#define x(_name) \
._name = opts->_name, \
._name##_from_inode = opts->_name##_from_inode,
BCH_REBALANCE_OPTS()
#undef x
};
if (r.background_target &&
!bch2_target_accepts_data(c, BCH_DATA_user, r.background_target))
r.background_target = 0;
return r;
};
u64 bch2_bkey_sectors_need_rebalance(struct bch_fs *, struct bkey_s_c);
int bch2_bkey_set_needs_rebalance(struct bch_fs *, struct bch_io_opts *, struct bkey_i *);
int bch2_get_update_rebalance_opts(struct btree_trans *,

View file

@ -428,7 +428,7 @@ static int __bch2_subvolume_delete(struct btree_trans *trans, u32 subvolid)
bch2_bkey_get_iter_typed(trans, &snapshot_iter,
BTREE_ID_snapshots, POS(0, snapid),
0, snapshot);
ret = bkey_err(subvol);
ret = bkey_err(snapshot);
bch2_fs_inconsistent_on(bch2_err_matches(ret, ENOENT), trans->c,
"missing snapshot %u", snapid);
if (ret)
@ -440,6 +440,11 @@ static int __bch2_subvolume_delete(struct btree_trans *trans, u32 subvolid)
bch2_bkey_get_iter_typed(trans, &snapshot_tree_iter,
BTREE_ID_snapshot_trees, POS(0, treeid),
0, snapshot_tree);
ret = bkey_err(snapshot_tree);
bch2_fs_inconsistent_on(bch2_err_matches(ret, ENOENT), trans->c,
"missing snapshot tree %u", treeid);
if (ret)
goto err;
if (le32_to_cpu(snapshot_tree.v->master_subvol) == subvolid) {
struct bkey_i_snapshot_tree *snapshot_tree_mut =

View file

@ -411,6 +411,17 @@ bool bch2_fs_emergency_read_only(struct bch_fs *c)
return ret;
}
bool bch2_fs_emergency_read_only_locked(struct bch_fs *c)
{
bool ret = !test_and_set_bit(BCH_FS_emergency_ro, &c->flags);
bch2_journal_halt_locked(&c->journal);
bch2_fs_read_only_async(c);
wake_up(&bch2_read_only_wait);
return ret;
}
static int bch2_fs_read_write_late(struct bch_fs *c)
{
int ret;

View file

@ -29,6 +29,7 @@ int bch2_dev_resize(struct bch_fs *, struct bch_dev *, u64);
struct bch_dev *bch2_dev_lookup(struct bch_fs *, const char *);
bool bch2_fs_emergency_read_only(struct bch_fs *);
bool bch2_fs_emergency_read_only_locked(struct bch_fs *);
void bch2_fs_read_only(struct bch_fs *);
int bch2_fs_read_write(struct bch_fs *);

View file

@ -727,7 +727,7 @@ DEFINE_EVENT(fs_str, bucket_alloc_fail,
TP_ARGS(c, str)
);
TRACE_EVENT(discard_buckets,
DECLARE_EVENT_CLASS(discard_buckets_class,
TP_PROTO(struct bch_fs *c, u64 seen, u64 open,
u64 need_journal_commit, u64 discarded, const char *err),
TP_ARGS(c, seen, open, need_journal_commit, discarded, err),
@ -759,6 +759,18 @@ TRACE_EVENT(discard_buckets,
__entry->err)
);
DEFINE_EVENT(discard_buckets_class, discard_buckets,
TP_PROTO(struct bch_fs *c, u64 seen, u64 open,
u64 need_journal_commit, u64 discarded, const char *err),
TP_ARGS(c, seen, open, need_journal_commit, discarded, err)
);
DEFINE_EVENT(discard_buckets_class, discard_buckets_fast,
TP_PROTO(struct bch_fs *c, u64 seen, u64 open,
u64 need_journal_commit, u64 discarded, const char *err),
TP_ARGS(c, seen, open, need_journal_commit, discarded, err)
);
TRACE_EVENT(bucket_invalidate,
TP_PROTO(struct bch_fs *c, unsigned dev, u64 bucket, u32 sectors),
TP_ARGS(c, dev, bucket, sectors),