mirror of
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
synced 2025-08-05 16:54:27 +00:00

On disk format is now soft frozen: no more required/automatic are anticipated before taking off the experimental label. Major changes/features since 6.14: - Scrub - Blocksize greater than page size support - A number of "rebalance spinning and doing no work" issues have been fixed; we now check if the write allocation will succeed in bch2_data_update_init(), before kicking off the read. There's still more work to do in this area. Later we may want to add another bitset btree, like rebalance_work, to track "extents that rebalance was requested to move but couldn't", e.g. due to destination target having insufficient online devices. - We can now support scaling well into the petabyte range: latest bcachefs-tools will pick an appropriate bucket size at format time to ensure fsck can run in available memory (e.g. a server with 256GB of ram and 100PB of storage would want 16MB buckets). On disk format changes: - 1.21: cached backpointers (scalability improvement) Cached replicas now get backpointers, which means we no longer rely on incrementing bucket generation numbers to invalidate cached data: this lets us get rid of the bucket generation number garbage collection, which had to periodically rescan all extents to recompute bucket oldest_gen. Bucket generation numbers are now only used as a consistency check, but they're quite useful for that. - 1.22: stripe backpointers Stripes now have backpointers: erasure coded stripes have their own checksums, separate from the checksums for the extents they contain (and stripe checksums also cover the parity blocks). This is required for implementing scrub for stripes. - 1.23: stripe lru (scalability improvement) Persistent lru for stripes, ordered by "number of empty blocks". This is used by the stripe creation path, which depending on free space may create a new stripe out of a partially empty existing stripe instead of starting a brand new stripe. This replaces an in-memory heap, and means we no longer have to read in the stripes btree at startup. - 1.24: casefolding Case insensitive directory support, courtesy of Valve. This is an incompatible feature, to enable mount with -o version_upgrade=incompatible - 1.25: extent_flags Another incompatible feature requiring explicit opt-in to enable. This adds a flags entry to extents, and a flag bit that marks extents as poisoned. A poisoned extent is an extent that was unreadable due to checksum errors. We can't move such extents without giving them a new checksum, and we may have to move them (for e.g. copygc or device evacuate). We also don't want to delete them: in the future we'll have an API that lets userspace ignore checksum errors and attempt to deal with simple bitrot itself. Marking them as poisoned lets us continue to return the correct error to userspace on normal read calls. Other changes/features: - BCH_IOCTL_QUERY_COUNTERS: this is used by the new 'bcachefs fs top' command, which shows a live view of all internal filesystem counters. - Improved journal pipelining: we can now have 16 journal writes in flight concurrently, up from 4. We're logging significantly more to the journal than we used to with all the recent disk accounting changes and additions, so some users should see a performance increase on some workloads. - BCH_MEMBER_STATE_failed: previously, we would do no IO at all to devices marked as failed. Now we will attempt to read from them, but only if we have no better options. - New option, write_error_timeout: devices will be kicked out of the filesystem if all writes have been failing for x number of seconds. We now also kick devices out when notified by blk_holder_ops that they've gone offline. - Device option handling improvements: the discard option should now be working as expected (additionally, in -tools, all device options that can be set at format time can now be set at device add time, i.e. data_allowed, state). - We now try harder to read data after a checksum error: we'll do additional retries if necessary to a device after after it gave us data with a checksum error. - More self healing work: the full inode <-> dirent consistency checks that are currently run by fsck are now also run every time we do a lookup, meaning we'll be able to correct errors at runtime. Runtime self healing will be flipped on after the new changes have seen more testing, currently they're just checking for consistency. - KMSAN fixes: our KMSAN builds should be nearly clean now, which will put a massive dent in the syzbot dashboard. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAmfhbnsACgkQE6szbY3K bnY6ew/9FXh3m71BvVpuqTYcUGzIC7gVrnkFy6n4W96v07OjSOoTNHOVVovajxc3 P9LvA77BHC4Xro3H7ORpsIurOZUc6yx18ZizzulVbQFuYa7LY/kNri4ZBtGHcRiV pIdQDLSNmwFjPA4x2S1qTFSF1c586lad+UNQiLam5ophBwQPEO6vG51ZEHa4wld9 +OhWTDYfrvij4D3Lt1ppvhuDP+PQBjhu/QFc0bGjHvKOjfV6sw9XU91sCYKOJIzd qzpsiQd5sepnX717Br3f5SLdxMq2lJYvRp9756vltOCaMBvJYJtHqtXCglHQEkFw yjhmPjk4r3VlKTF8K+wEJfAHwbC2kEn7csJNbt0+Nko5PPtFyrb8ok6QUbHCKscL L0VMnzaXHVqvG2VgYa31temfdz7HM/zHjQ8Al3eQPaqTHIoTXIBQxOQSea/apVMt TIlastvLoHfR8W7+LrwOmTjnBJGCJ+MrdcJzJDVk2tQmmcMA0boeZvl4aSklFuyB zNN5fxp0VMsxNyIHLJjQ3UcwVqHXC5w+f5H1ByQLUyQh+m/xaAaz7S+BTVdVbFPa 1Z1xDuvuHOTnjIOamnOD1l36afJnhq5RciPCXCNtQSB819mc+AfNGQNQTVNOTReC iTiUCcNxu0/DIPlPmeJzAlukVJUgz+/knOI/6zPs3eI7/o88ZGg= =k3cV -----END PGP SIGNATURE----- Merge tag 'bcachefs-2025-03-24' of git://evilpiepirate.org/bcachefs Pull bcachefs updates from Kent Overstreet: "On disk format is now soft frozen: no more required/automatic are anticipated before taking off the experimental label. Major changes/features since 6.14: - Scrub - Blocksize greater than page size support - A number of "rebalance spinning and doing no work" issues have been fixed; we now check if the write allocation will succeed in bch2_data_update_init(), before kicking off the read. There's still more work to do in this area. Later we may want to add another bitset btree, like rebalance_work, to track "extents that rebalance was requested to move but couldn't", e.g. due to destination target having insufficient online devices. - We can now support scaling well into the petabyte range: latest bcachefs-tools will pick an appropriate bucket size at format time to ensure fsck can run in available memory (e.g. a server with 256GB of ram and 100PB of storage would want 16MB buckets). On disk format changes: - 1.21: cached backpointers (scalability improvement) Cached replicas now get backpointers, which means we no longer rely on incrementing bucket generation numbers to invalidate cached data: this lets us get rid of the bucket generation number garbage collection, which had to periodically rescan all extents to recompute bucket oldest_gen. Bucket generation numbers are now only used as a consistency check, but they're quite useful for that. - 1.22: stripe backpointers Stripes now have backpointers: erasure coded stripes have their own checksums, separate from the checksums for the extents they contain (and stripe checksums also cover the parity blocks). This is required for implementing scrub for stripes. - 1.23: stripe lru (scalability improvement) Persistent lru for stripes, ordered by "number of empty blocks". This is used by the stripe creation path, which depending on free space may create a new stripe out of a partially empty existing stripe instead of starting a brand new stripe. This replaces an in-memory heap, and means we no longer have to read in the stripes btree at startup. - 1.24: casefolding Case insensitive directory support, courtesy of Valve. This is an incompatible feature, to enable mount with -o version_upgrade=incompatible - 1.25: extent_flags Another incompatible feature requiring explicit opt-in to enable. This adds a flags entry to extents, and a flag bit that marks extents as poisoned. A poisoned extent is an extent that was unreadable due to checksum errors. We can't move such extents without giving them a new checksum, and we may have to move them (for e.g. copygc or device evacuate). We also don't want to delete them: in the future we'll have an API that lets userspace ignore checksum errors and attempt to deal with simple bitrot itself. Marking them as poisoned lets us continue to return the correct error to userspace on normal read calls. Other changes/features: - BCH_IOCTL_QUERY_COUNTERS: this is used by the new 'bcachefs fs top' command, which shows a live view of all internal filesystem counters. - Improved journal pipelining: we can now have 16 journal writes in flight concurrently, up from 4. We're logging significantly more to the journal than we used to with all the recent disk accounting changes and additions, so some users should see a performance increase on some workloads. - BCH_MEMBER_STATE_failed: previously, we would do no IO at all to devices marked as failed. Now we will attempt to read from them, but only if we have no better options. - New option, write_error_timeout: devices will be kicked out of the filesystem if all writes have been failing for x number of seconds. We now also kick devices out when notified by blk_holder_ops that they've gone offline. - Device option handling improvements: the discard option should now be working as expected (additionally, in -tools, all device options that can be set at format time can now be set at device add time, i.e. data_allowed, state). - We now try harder to read data after a checksum error: we'll do additional retries if necessary to a device after after it gave us data with a checksum error. - More self healing work: the full inode <-> dirent consistency checks that are currently run by fsck are now also run every time we do a lookup, meaning we'll be able to correct errors at runtime. Runtime self healing will be flipped on after the new changes have seen more testing, currently they're just checking for consistency. - KMSAN fixes: our KMSAN builds should be nearly clean now, which will put a massive dent in the syzbot dashboard" * tag 'bcachefs-2025-03-24' of git://evilpiepirate.org/bcachefs: (180 commits) bcachefs: Kill unnecessary bch2_dev_usage_read() bcachefs: btree node write errors now print btree node bcachefs: Fix race in print_chain() bcachefs: btree_trans_restart_foreign_task() bcachefs: bch2_disk_accounting_mod2() bcachefs: zero init journal bios bcachefs: Eliminate padding in move_bucket_key bcachefs: Fix a KMSAN splat in btree_update_nodes_written() bcachefs: kmsan asserts bcachefs: Fix kmsan warnings in bch2_extent_crc_pack() bcachefs: Disable asm memcpys when kmsan enabled bcachefs: Handle backpointers with unknown data types bcachefs: Count BCH_DATA_parity backpointers correctly bcachefs: Run bch2_check_dirent_target() at lookup time bcachefs: Refactor bch2_check_dirent_target() bcachefs: Move bch2_check_dirent_target() to namei.c bcachefs: fs-common.c -> namei.c bcachefs: EIO cleanup bcachefs: bch2_write_prep_encoded_data() now returns errcode bcachefs: Simplify bch2_write_op_error() ...
105 lines
4.7 KiB
ReStructuredText
105 lines
4.7 KiB
ReStructuredText
Submitting patches to bcachefs
|
|
==============================
|
|
|
|
Here are suggestions for submitting patches to bcachefs subsystem.
|
|
|
|
Submission checklist
|
|
--------------------
|
|
|
|
Patches must be tested before being submitted, either with the xfstests suite
|
|
[0]_, or the full bcachefs test suite in ktest [1]_, depending on what's being
|
|
touched. Note that ktest wraps xfstests and will be an easier method to running
|
|
it for most users; it includes single-command wrappers for all the mainstream
|
|
in-kernel local filesystems.
|
|
|
|
Patches will undergo more testing after being merged (including
|
|
lockdep/kasan/preempt/etc. variants), these are not generally required to be
|
|
run by the submitter - but do put some thought into what you're changing and
|
|
which tests might be relevant, e.g. are you dealing with tricky memory layout
|
|
work? kasan, are you doing locking work? then lockdep; and ktest includes
|
|
single-command variants for the debug build types you'll most likely need.
|
|
|
|
The exception to this rule is incomplete WIP/RFC patches: if you're working on
|
|
something nontrivial, it's encouraged to send out a WIP patch to let people
|
|
know what you're doing and make sure you're on the right track. Just make sure
|
|
it includes a brief note as to what's done and what's incomplete, to avoid
|
|
confusion.
|
|
|
|
Rigorous checkpatch.pl adherence is not required (many of its warnings are
|
|
considered out of date), but try not to deviate too much without reason.
|
|
|
|
Focus on writing code that reads well and is organized well; code should be
|
|
aesthetically pleasing.
|
|
|
|
CI
|
|
--
|
|
|
|
Instead of running your tests locally, when running the full test suite it's
|
|
preferable to let a server farm do it in parallel, and then have the results
|
|
in a nice test dashboard (which can tell you which failures are new, and
|
|
presents results in a git log view, avoiding the need for most bisecting).
|
|
|
|
That exists [2]_, and community members may request an account. If you work for
|
|
a big tech company, you'll need to help out with server costs to get access -
|
|
but the CI is not restricted to running bcachefs tests: it runs any ktest test
|
|
(which generally makes it easy to wrap other tests that can run in qemu).
|
|
|
|
Other things to think about
|
|
---------------------------
|
|
|
|
- How will we debug this code? Is there sufficient introspection to diagnose
|
|
when something starts acting wonky on a user machine?
|
|
|
|
We don't necessarily need every single field of every data structure visible
|
|
with introspection, but having the important fields of all the core data
|
|
types wired up makes debugging drastically easier - a bit of thoughtful
|
|
foresight greatly reduces the need to have people build custom kernels with
|
|
debug patches.
|
|
|
|
More broadly, think about all the debug tooling that might be needed.
|
|
|
|
- Does it make the codebase more or less of a mess? Can we also try to do some
|
|
organizing, too?
|
|
|
|
- Do new tests need to be written? New assertions? How do we know and verify
|
|
that the code is correct, and what happens if something goes wrong?
|
|
|
|
We don't yet have automated code coverage analysis or easy fault injection -
|
|
but for now, pretend we did and ask what they might tell us.
|
|
|
|
Assertions are hugely important, given that we don't yet have a systems
|
|
language that can do ergonomic embedded correctness proofs. Hitting an assert
|
|
in testing is much better than wandering off into undefined behaviour la-la
|
|
land - use them. Use them judiciously, and not as a replacement for proper
|
|
error handling, but use them.
|
|
|
|
- Does it need to be performance tested? Should we add new performance counters?
|
|
|
|
bcachefs has a set of persistent runtime counters which can be viewed with
|
|
the 'bcachefs fs top' command; this should give users a basic idea of what
|
|
their filesystem is currently doing. If you're doing a new feature or looking
|
|
at old code, think if anything should be added.
|
|
|
|
- If it's a new on disk format feature - have upgrades and downgrades been
|
|
tested? (Automated tests exists but aren't in the CI, due to the hassle of
|
|
disk image management; coordinate to have them run.)
|
|
|
|
Mailing list, IRC
|
|
-----------------
|
|
|
|
Patches should hit the list [3]_, but much discussion and code review happens
|
|
on IRC as well [4]_; many people appreciate the more conversational approach
|
|
and quicker feedback.
|
|
|
|
Additionally, we have a lively user community doing excellent QA work, which
|
|
exists primarily on IRC. Please make use of that resource; user feedback is
|
|
important for any nontrivial feature, and documenting it in commit messages
|
|
would be a good idea.
|
|
|
|
.. rubric:: References
|
|
|
|
.. [0] git://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git
|
|
.. [1] https://evilpiepirate.org/git/ktest.git/
|
|
.. [2] https://evilpiepirate.org/~testdashboard/ci/
|
|
.. [3] linux-bcachefs@vger.kernel.org
|
|
.. [4] irc.oftc.net#bcache, #bcachefs-dev
|