Commit graph

247 commits

Author SHA1 Message Date
Kent Overstreet
4ba99dde33 bcachefs: BCH_INODE_has_case_insensitive
Add a flag for tracking whether a directory has case-insensitive
descendents - so that overlayfs can disallow mounting, even though the
filesystem supports case insensitivity.

This is a new on disk format version, with a (cheap) upgrade to ensure
the flag is correctly set on existing inodes.

Create, rename and fssetxattr are all plumbed to ensure the new flag is
set, and we've got new fsck code that hooks into check_inode(0.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:15:10 -04:00
Kent Overstreet
e21f997721 bcachefs: bch_sb_field_recovery_passes
New superblock section for statistics on recovery passes - last time
ran (successfully), last runtime.

This will be used by self healing code to determine when to kick off
potentially expensive recovery passes.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:14:53 -04:00
Kent Overstreet
20a4b7f3b8 bcachefs: recovery_passes_types.h -> recovery_passes_format.h
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:14:52 -04:00
Kent Overstreet
a8539ad8fa bcachefs: bcachefs_metadata_version_fast_device_removal
Fast device removal, that uses backpointers to find pointers to the
device being removed instead of a full metadata scan.

This requires BCH_SB_MEMBER_DELETED_UUID, which is an incompatible
change - hence the version number bump. We don't fully trust
backpointers, so we don't want to reuse device indexes until after a
fsck has verified that there aren't any pointers to removed devices.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:14:49 -04:00
Kent Overstreet
96fc7d8adb bcachefs: opts.rebalance_on_ac_only
Add an option for setting rebalance to only run when connected to mains
power.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:14:48 -04:00
Kent Overstreet
e9756dd29f bcachefs: bcachefs_metadata_version_snapshot_deletion_v2
We're going to be speeding up snapshot deletion, by only having it
process the extents/dirents/xattrs btrees if an inode of a given
snapshot ID was present.

This raises the possibility of 'bkey_in_missing_snapshot' errors popping
up, if we ever accidentally don't do the corresponding inode update, or
if the new algorithm has bugs.

So instead of deleting snapshot IDs, add a new deleted flag, so that
'key in missing snapshot' errors can more definitively tell what
happened and automatically repair.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:14:45 -04:00
Kent Overstreet
530112d88e bcachefs: BCH_FEATURE_small_image
We can't go RW if it's an image file that hasn't been resized.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:14:20 -04:00
Kent Overstreet
203852d9db bcachefs: BCH_FEATURE_no_alloc_info
If a filesystem is going to only be used read-only, and will be a
deployable image, we can strip out alloc info for a substantial
reduction in metadata size - around half, due to backpointers.

Alloc info will be regenerated on first read-write mount.

Remounting RW is disallowed for now, since we don't yet have
check_allocations running in RW mode.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:14:20 -04:00
Kent Overstreet
c02e5b5728 bcachefs: Single device mode
Single device filesystems are now identified by the block device name,
not the UUID - and single device filesystems with the same UUID can be
mounted simultaneously, without any special options.

This allocates a new bit in the superblock, BCH_SB_MULTI_DEVICE, which
indicates whether a filesystem has ever been multi device.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:14:15 -04:00
Kent Overstreet
ef8dd631f7 bcachefs: Improve opts.degraded
Kill 'opts.very_degraded', and make 'opts.degraded' a persistent option,
stored in the superblock.

It's now an enum, with available choices ask/yes/very/no.

"ask" mode will be handled by the mount helper, for prompting the user
(on a machine used interactively) for whether to do a degraded mount.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:14:12 -04:00
Kent Overstreet
b9e1f873d2 bcachefs: Casefold is now a regular opts.h option
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-24 19:09:00 -04:00
Kent Overstreet
71f8e806a5 bcachefs: Stricter checks on "key allowed in this btree"
Syzbot managed to come up with a filesystem where check/repair got
rather confused at finding a reflink pointer in the inodes btree.

Currently, the "key allowed in this btree" checks only apply at commit
time, not read time - for forwards compatibility. It seems this is too
loose.

Now, strict key type allowed checks apply:
 - at commit time (no forward compatibility issues)
 - for btree node pointers
 - if it's a known btree, known key type, and the key type has the
   "BKEY_TYPE_strict_btree_checks" flag.

This means we still have the option of using generic key types - e.g.
KEY_TYPE_error, KEY_TYPE_set - on more existing btrees in the future,
while most key types that are intended for only a specific btree get
stricter checks.

Reported-by: syzbot+baee8591f336cab0958b@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-20 19:41:38 -04:00
Kent Overstreet
edaed8ee8c bcachefs: BCH_JSET_ENTRY_log_bkey
Add a journal entry type for logging - but logging a bkey, not a string;
to be used for data move path debugging.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-30 18:25:12 -04:00
Kent Overstreet
be31e412ac bcachefs: Checksum errors get additional retries
It's possible for checksum errors to be transient - e.g. flakey
controller or cable, thus we need additional retries (besides retrying
from different replicas) before we can definitely return an error.

This is particularly important for the next patch, which will allow the
data move path to move extents with checksum errors - we don't want to
accidentally introduce bitrot due to a transient error!

- bch2_bkey_pick_read_device() is substantially reworked, and
  bch2_dev_io_failures is expanded to record more information about the
  type of failure (i.e. number of checksum errors).

  It now returns an error code that describes more precisely the reason
  for the failure - checksum error, io error, or offline device, instead
  of the previous generic "insufficient devices". This is important for
  the next patches that add poisoning, as we only want to poison extents
  when we've got real checksum errors (or perhaps IO errors?) - not
  because a device was offline.

- Add a new option and superblock field for the number of checksum
  retries.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-16 13:47:55 -04:00
Kent Overstreet
981e380144 bcachefs: Kick devices out after too many write IO errors
We're improving our handling of write errors - we shouldn't write
degraded data just because a write failed once, we should retry it (on
other devices, if possible).

But for this to work, we need to kick devices out when they're only
returning errors - otherwise those retries will loop infinitely.

This adds a configurable timeout - if writes are failing for too long,
we'll set that device read-only.

In the future we should also implement more tracking and another knob
for an "allowed error rate", so that we can kick out drives that are
acting "unhealthy".

Another thing we'll want is a mechanism (likely in userspace) for
bringing a device back in after a transient error - perhaps a cable was
jiggled, or there was a controller reset.

After transient errors we also need a mechanism to walk (from the
journal) recent btree updates that weren't flushed to that device and
treat them as "degraded", since unflushed data may well not have been
written. Out of scope for this patch, but becoming relevant.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:16 -04:00
Kent Overstreet
4a90675cfe bcachefs: bcachefs_metadata_version_extent_flags
This implements a new extent field bitflags that apply to the whole
extent. There's been a couple things we've wanted this for in the past,
but the immediate need is extent poisoning, to solve a rebalance issue.

Unknown extent fields can't be parsed (we won't known their size, so we
can't advance to the next field), so this is an incompat feature, and
using it prevents the filesystem from being mounted by old versions.

This also adds the BCH_EXTENT_poisoned flag; this indicates that the
data is known to be bad (i.e. there was a checksum error, and we had to
write a new checksum) and reads will return errors.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:15 -04:00
Kent Overstreet
fb195fa753 bcachefs: BCH_SB_FEATURES_ALL includes BCH_FEATURE_incompat_verison_field
These features are set on format and incompat upgarde.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:15 -04:00
Joshua Ashton
d37c14ac6f bcachefs: bcachefs_metadata_version_casefolding
This patch implements support for case-insensitive file name lookups
in bcachefs.

The implementation uses the same UTF-8 lowering and normalization that
ext4 and f2fs is using.

More information is provided in Documentation/bcachefs/casefolding.rst

Compatibility notes:

This uses the new versioning scheme for incompatible features where an
incompatible feature is tied to a version number: the superblock says
"we may use incompat features up to x" and "incompat features up to x
are in use", disallowing mounting by previous versions.

Additionally, and old style incompat feature bit is used, so that
kernels without utf8 casefolding support know if casefolding
specifically is in use and they're allowed to mount.

Signed-off-by: Joshua Ashton <joshua@froggi.es>
Cc: André Almeida <andrealmeid@igalia.com>
Cc: Gabriel Krisman Bertazi <krisman@suse.de>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:15 -04:00
Kent Overstreet
6756e385a5 bcachefs: bcachefs_metadata_version_stripe_lru
Add a persistent LRU for stripes, ordered by "number of empty blocks",
i.e. order in which we wish to reuse them.

This will replace the in-memory stripes heap, so we can kill off reading
stripes into memory at startup.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:15 -04:00
Kent Overstreet
88d961b518 bcachefs: bcachefs_metadata_version_stripe_backpointers
Stripes now have backpointers.

This is needed for proper scrub - stripe checksums need to be verified,
separately from extents within the stripe, since a block may not be full
of live extents but it's still needed for reconstruct.

And this will be needed for (efficient) evacuate/repair paths.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:15 -04:00
Kent Overstreet
15800f3d4b bcachefs: bcachefs_metadata_version_cached_backpointers
Cached pointers now have backpointers.

This means that we'll be able to kill cached pointers in the
bucket_invalidate path, when invalidating/reusing buckets containing
cached data, instead of leaving them around to be cleaned up by gc_gens
garbago collection - which requires a full metadata scan.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:14 -04:00
Hongbo Li
c72deb03ff bcachefs: bcachefs_metadata_version_directory_size
This adds another metadata version for accounting directory size.
For the new version of the filesystem, when new subdirectory items
are created or deleted, the parent directory's size will change
accordingly. For the old version of the existed file system, running
fsck will automatically upgrade the metadata version, and it will
do the check and recalculationg of the directory size.

Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-01-13 14:58:38 -05:00
Kent Overstreet
cb3f34982c bcachefs: Assert that btree write buffer only touches the right btrees
More asserts, more better.

Also, clean up the per-btree flags a bit.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-01-09 23:38:41 -05:00
Kent Overstreet
3db3084a86 bcachefs: bcachefs_metadata_version_autofix_errors
It's time to make self healing the default: change the error action for
old filesystems to fix_safe, matching the default for current
filesystems.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-01-09 23:38:41 -05:00
Kent Overstreet
df448ca355 bcachefs: bcachefs_metadata_version_persistent_inode_cursors
Persistent cursors for inode allocation.

A free inodes btree would add substantial overhead to inode allocation
and freeing - a "next num to allocate" cursor is always going to be
faster.

We just need it to be persistent, to avoid scanning the inodes btree
from the start on startup.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-01-09 23:38:41 -05:00
Kent Overstreet
59c50511f7 bcachefs: bcachefs_metadata_version_inode_depth
This adds a new inode field, bi_depth, for directory inodes: this allows
us to make the check_directory_structure pass much more efficient.

Currently, to ensure the filesystem is fully connect and has no loops,
for every directory we follow backpointers until we find the root. But
by adding a depth counter, it sufficies to only check the parent of each
directory, and check that the parent's bi_depth is smaller.

(fsck doesn't require that bi_depth = parent->bi_depth + 1; if a rename
causes bi_depth off, but the chain to the root is still strictly
decreasing, then the algorithm still works and there's no need for fsck
to fixup the bi_depth fields).

We've already checked backpointers, so we know that every directory
(excluding the root)has a valid parent: if bi_depth is always
decreasing, every chain must terminate, and terminate at the root
directory.

bi_depth will not necessarily be correct when fsck runs, due to
directory renames - we can't change bi_depth on every child directory
when renaming a directory. That's ok; fsck will silently fix the
bi_depth field as needed, and future fsck runs will be much faster.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-29 13:30:39 -05:00
Kent Overstreet
ea4f9e75ec bcachefs: bcachefs_metadata_version_reflink_p_may_update_opts
Previously, io path option changes on a file would be picked up
automatically and applied to existing data - but not for reflinked data,
as we had no way of doing this safely. A user may have had permission to
copy (and reflink) a given file, but not write to it, and if so they
shouldn't be allowed to change e.g. nr_replicas or other options.

This uses the incompat feature mechanism in the previous patch to add a
new incompatible flag to bch_reflink_p, indicating whether a given
reflink pointer may propagate io path option changes back to the
indirect extent.

In this initial patch we're only setting it for the source extents.

We'd like to set it for the destination in a reflink copy, when the user
has write access to the source, but that requires mnt_idmap which is not
curretly plumbed up to remap_file_range.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-29 13:30:39 -05:00
Kent Overstreet
a36d8f0e0e bcachefs: BCH_SB_VERSION_INCOMPAT
We've been getting away from feature bits: they don't have any kind of
ordering, and thus it's possible for people to enable weird combinations
of features that were never tested or intended to be run.

Much better to just give every new feature, compatible or incompatible,
a version number.

Additionally, we probably won't ever rev the major version number: major
version numbers represent incompatible versions, but that doesn't really
fit with how we actually roll out incompatible features - we need a
better way of rolling out incompatible features.

So, this patch adds two new superblock fields:
- BCH_SB_VERSION_INCOMPAT
- BCH_SB_VERSION_INCOMPAT_ALLOWED

BCH_SB_VERSION_INCOMPAT_ALLOWED indicates that incompatible features up
to version number x are allowed to be used without user prompting, but
it does not by itself deny old versions from mounting.

BCH_SB_VERSION_INCOMPAT does deny old versions from mounting, and must
be <= BCH_SB_VERSION_INCOMPAT_ALLOWED.

BCH_SB_VERSION_INCOMPAT will only be set when a codepath attempts to use
an incompatible feature, so as to not unnecessarily break compatibility
with old versions.

bch2_request_incompat_feature() is the new interface to check if an
incompatible feature may be used.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-29 13:30:39 -05:00
Kent Overstreet
ba9752e5f4 bcachefs: bcachefs_metadata_version_disk_accounting_big_endian
Fix sort order for disk accounting keys, in order to fix a regression on
mount times.

The typetag is now the most significant byte of the key, meaning disk
accounting keys of the same type now sort together.

This lets us skip over disk accounting keys that aren't mirrored in
memory when reading accounting at startup, instead of having them
interleaved with other counter types.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-29 13:30:39 -05:00
Kent Overstreet
ebdca07268 bcachefs: bcachefs_metadata_version_backpointer_bucket_gen
New on disk format version: backpointers new include the generation
number of the bucket they refer to, and the obsolete bucket_offset field
(no longer needed because we no longer store backpointers in alloc keys)
is gone.

This is an expensive forced upgrade - hopefully the last; we have to run
the extents_to_backpointers recovery pass to regenerate backpointers.

It's a forced incompatible upgrade because the alternative would've been
permamently making backpointers bigger, and as one of the biggest btrees
(along with the extents btree) that's not an ideal option.

It's worth it though, because this allows us to make the
check_extents_to_backpointers pass drastically cheaper: an upcoming
patch changes it to sum up backpointers in a bucket and check the sum
against the sector counts for that bucket, only looking for missing
backpointers if they don't match (and then only for specific buckets).

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-29 13:30:39 -05:00
Kent Overstreet
ed13bb5726 bcachefs: bch2_prt_csum_opt()
bounds checking helper

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:16 -05:00
Kent Overstreet
106480e9a8 bcachefs: Factor out jset_entry_log_msg_bytes()
Needed for improved userspace cmd_list_journal

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:15 -05:00
Kent Overstreet
26c79fdc58 bcachefs: lru, accounting are alloc btrees
They can be regenerated by fsck and don't require a btree node scan,
like other alloc btrees.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:14 -05:00
Thorsten Blum
03525de506 bcachefs: Remove duplicate included headers
The header files dirent_format.h and disk_groups_format.h are included
twice. Remove the redundant includes and the following warnings reported
by make includecheck:

  disk_groups_format.h is included more than once
  dirent_format.h is included more than once

Reviewed-by: Hongbo Li <lihongbo22@huawei.com>
Signed-off-by: Thorsten Blum <thorsten.blum@toblux.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:14 -05:00
Kent Overstreet
9b23fdbd5d bcachefs: bcachefs_metadata_version_inode_has_child_snapshots
There's an inherent race in taking a snapshot while an unlinked file is
open, and then reattaching it in the child snapshot.

In the interior snapshot node the file will appear unlinked, as though
it should be deleted - it's not referenced by anything in that snapshot
- but we can't delete it, because the file data is referenced by the
child snapshot.

This was being handled incorrectly with
propagate_key_to_snapshot_leaves() - but that doesn't resolve the
fundamental inconsistency of "this file looks like it should be deleted
according to normal rules, but - ".

To fix this, we need to fix the rule for when an inode is deleted. The
previous rule, ignoring snapshots (there was no well-defined rule
for with snapshots) was:
  Unlinked, non open files are deleted, either at recovery time or
  during online fsck

The new rule is:
  Unlinked, non open files, that do not exist in child snapshots, are
  deleted.

To make this work transactionally, we add a new inode flag,
BCH_INODE_has_child_snapshot; it overrides BCH_INODE_unlinked when
considering whether to delete an inode, or put it on the deleted list.

For transactional consistency, clearing it handled by the inode trigger:
when deleting an inode we check if there are parent inodes which can now
have the BCH_INODE_has_child_snapshot flag cleared.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-09 16:42:51 -04:00
Guenter Roeck
2007d28ec0 bcachefs: rename version -> bversion for big endian builds
Builds on big endian systems fail as follows.

fs/bcachefs/bkey.h: In function 'bch2_bkey_format_add_key':
fs/bcachefs/bkey.h:557:41: error:
	'const struct bkey' has no member named 'bversion'

The original commit only renamed the variable for little endian builds.
Rename it for big endian builds as well to fix the problem.

Fixes: cf49f8a8c2 ("bcachefs: rename version -> bversion")
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-29 23:55:52 -04:00
Kent Overstreet
cf49f8a8c2 bcachefs: rename version -> bversion
give bversions a more distinct name, to aid in grepping

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-27 21:46:35 -04:00
Kent Overstreet
c7652f253a bcachefs: promote_whole_extents is now a normal option
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-09 09:41:48 -04:00
Kent Overstreet
49aa783039 bcachefs: Fix rebalance_work accounting
rebalance_work was keying off of the presence of rebelance_opts in the
extent - but that was incorrect, we keep those around after rebalance
for indirect extents since the inode's options are not directly
available

Fixes: 20ac515a9c ("bcachefs: bch_acct_rebalance_work")
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-08-24 10:16:21 -04:00
Kent Overstreet
58474f76a7 bcachefs: bcachefs_metadata_version_disk_accounting_inum
This adds another disk accounting counter to track usage per inode
number (any snapshot ID).

This will be used for a couple things:

- It'll give us a way to tell the user how much space a given file ista
  consuming in all snapshots; i.e. how much extra space it's consuming
  due to snapshot versioning.

- It counts number of extents and total size of extents (both in btree
  keyspace sectors and actual disk usage), meaning it gives us average
  extent size: that is, it'll let us cheaply find fragmented files that
  should be defragmented.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-08-13 23:00:50 -04:00
Kent Overstreet
8a2491db7b bcachefs: bcachefs_metadata_version_disk_accounting_v3
bcachefs_metadata_version_disk_accounting_v2 erroneously had padding
bytes in disk_accounting_key, which is a problem because we have to
guarantee that all unused bytes in disk_accounting_key are zeroed.

Fortunately 6.11 isn't out yet, so it's cheap to fix this by spinning a
new version.

Reported-by: Gabriel de Perthuis <g2p.code@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-08-09 19:21:28 -04:00
Kent Overstreet
cecf72798b bcachefs: Make allocator stuck timeout configurable, ratelimit messages
Limit these messages to once every 2 minutes to avoid spamming logs;
with multiple devices the output can be quite significant.

Also, up the default timeout to 30 seconds from 10 seconds.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-08-07 21:04:55 -04:00
Kent Overstreet
7aa7183e00 bcachefs: split out lru_format.h
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:16 -04:00
Kent Overstreet
1d16c605cc bcachefs: Disk space accounting rewrite
Main part of the disk accounting rewrite.

This is a wholesale rewrite of the existing disk space accounting, which
relies on percepu counters that are sharded by journal buffer, and
rolled up and added to each journal write.

With the new scheme, every set of counters is a distinct key in the
accounting btree; this fixes scaling limitations of the old scheme,
where counters took up space in each journal entry and required multiple
percpu counters.

Now, in memory accounting requires a single set of percpu counters - not
multiple for each in flight journal buffer - and in the future we'll
probably also have counters that don't use in memory percpu counters,
they're not strictly required.

An accounting update is now a normal btree update, using the btree write
buffer path. At transaction commit time, we apply accounting updates to
the in memory counters, which are percpu counters indexed in an
eytzinger tree by the accounting key.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:13 -04:00
Kent Overstreet
2744e5c9eb bcachefs: KEY_TYPE_accounting
New key type for the disk space accounting rewrite.

 - Holds a variable sized array of u64s (may be more than one for
   accounting e.g. compressed and uncompressed size, or buckets and
   sectors for a given data type)

 - Updates are deltas, not new versions of the key: this means updates
   to accounting can happen via the btree write buffer, which we'll be
   teaching to accumulate deltas.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:13 -04:00
Kent Overstreet
7773df19c3 bcachefs: metadata version bucket_stripe_sectors
New on disk format version for bch_alloc->stripe_sectors and
BCH_DATA_unstriped - accounting for unstriped data in stripe buckets.

Upgrade/downgrade requires regenerating alloc info - but only if erasure
coding is in use.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:12 -04:00
Kent Overstreet
2612e29142 bcachefs: BCH_DATA_unstriped
Add a new pseudo data type, to track buckets that are members of a
stripe, but have unstriped data in them.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:12 -04:00
Kent Overstreet
33dfafa902 bcachefs: Fix safe errors by default
i.e. the start of automatic self healing:

If errors=continue or fix_safe, we now automatically fix simple errors
without user intervention.

New error action option: fix_safe

This replaces the existing errors=ro option, which gets a new slot, i.e.
existing errors=ro users now get errors=fix_safe.

This is currently only enabled for a limited set of errors - initially
just disk accounting; errors we would never not want to fix, and we
don't want to require user intervention (i.e. to make sure a bug report
gets filed).

Errors will still be counted in the superblock, so we (developers) will
still know they've been occuring if a bug report gets filed (as bug
reports typically include the errors superblock section).

Eventually we'll be enabling this for a much wider set of errors, after
we've done thorough error injection testing.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-20 09:13:09 -04:00
Kent Overstreet
cff07e2739 bcachefs: Guard against overflowing LRU_TIME_BITS
LRUs only have 48 bits for the time field (i.e. LRU order); thus we need
overflow checks and guards.

Reported-by: syzbot+df3bf3f088dcaa728857@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-19 18:27:24 -04:00
Kent Overstreet
e3fd3faa45 bcachefs: Fix btree ID bitmasks
these should be 64 bit bitmasks, not 32 bit.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-19 18:27:23 -04:00