mirror of
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
synced 2025-08-05 16:54:27 +00:00

Last three sections of atomic block writes documentation are adorned as
first-level title headings, which erroneously increase toctree entries
in overview.rst. Demote them.
Fixes: 0bf1f51e34
("ext4: Add atomic block write documentation")
Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com>
Acked-by: Theodore Ts'o <tytso@mit.edu>
Acked-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Link: https://lore.kernel.org/r/20250620105643.25141-5-bagasdotme@gmail.com
225 lines
9.6 KiB
ReStructuredText
225 lines
9.6 KiB
ReStructuredText
.. SPDX-License-Identifier: GPL-2.0
|
||
.. _atomic_writes:
|
||
|
||
Atomic Block Writes
|
||
-------------------------
|
||
|
||
Introduction
|
||
~~~~~~~~~~~~
|
||
|
||
Atomic (untorn) block writes ensure that either the entire write is committed
|
||
to disk or none of it is. This prevents "torn writes" during power loss or
|
||
system crashes. The ext4 filesystem supports atomic writes (only with Direct
|
||
I/O) on regular files with extents, provided the underlying storage device
|
||
supports hardware atomic writes. This is supported in the following two ways:
|
||
|
||
1. **Single-fsblock Atomic Writes**:
|
||
EXT4's supports atomic write operations with a single filesystem block since
|
||
v6.13. In this the atomic write unit minimum and maximum sizes are both set
|
||
to filesystem blocksize.
|
||
e.g. doing atomic write of 16KB with 16KB filesystem blocksize on 64KB
|
||
pagesize system is possible.
|
||
|
||
2. **Multi-fsblock Atomic Writes with Bigalloc**:
|
||
EXT4 now also supports atomic writes spanning multiple filesystem blocks
|
||
using a feature known as bigalloc. The atomic write unit's minimum and
|
||
maximum sizes are determined by the filesystem block size and cluster size,
|
||
based on the underlying device’s supported atomic write unit limits.
|
||
|
||
Requirements
|
||
~~~~~~~~~~~~
|
||
|
||
Basic requirements for atomic writes in ext4:
|
||
|
||
1. The extents feature must be enabled (default for ext4)
|
||
2. The underlying block device must support atomic writes
|
||
3. For single-fsblock atomic writes:
|
||
|
||
1. A filesystem with appropriate block size (up to the page size)
|
||
4. For multi-fsblock atomic writes:
|
||
|
||
1. The bigalloc feature must be enabled
|
||
2. The cluster size must be appropriately configured
|
||
|
||
NOTE: EXT4 does not support software or COW based atomic write, which means
|
||
atomic writes on ext4 are only supported if underlying storage device supports
|
||
it.
|
||
|
||
Multi-fsblock Implementation Details
|
||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||
|
||
The bigalloc feature changes ext4 to allocate in units of multiple filesystem
|
||
blocks, also known as clusters. With bigalloc each bit within block bitmap
|
||
represents cluster (power of 2 number of blocks) rather than individual
|
||
filesystem blocks.
|
||
EXT4 supports multi-fsblock atomic writes with bigalloc, subject to the
|
||
following constraints. The minimum atomic write size is the larger of the fs
|
||
block size and the minimum hardware atomic write unit; and the maximum atomic
|
||
write size is smaller of the bigalloc cluster size and the maximum hardware
|
||
atomic write unit. Bigalloc ensures that all allocations are aligned to the
|
||
cluster size, which satisfies the LBA alignment requirements of the hardware
|
||
device if the start of the partition/logical volume is itself aligned correctly.
|
||
|
||
Here is the block allocation strategy in bigalloc for atomic writes:
|
||
|
||
* For regions with fully mapped extents, no additional work is needed
|
||
* For append writes, a new mapped extent is allocated
|
||
* For regions that are entirely holes, unwritten extent is created
|
||
* For large unwritten extents, the extent gets split into two unwritten
|
||
extents of appropriate requested size
|
||
* For mixed mapping regions (combinations of holes, unwritten extents, or
|
||
mapped extents), ext4_map_blocks() is called in a loop with
|
||
EXT4_GET_BLOCKS_ZERO flag to convert the region into a single contiguous
|
||
mapped extent by writing zeroes to it and converting any unwritten extents to
|
||
written, if found within the range.
|
||
|
||
Note: Writing on a single contiguous underlying extent, whether mapped or
|
||
unwritten, is not inherently problematic. However, writing to a mixed mapping
|
||
region (i.e. one containing a combination of mapped and unwritten extents)
|
||
must be avoided when performing atomic writes.
|
||
|
||
The reason is that, atomic writes when issued via pwritev2() with the RWF_ATOMIC
|
||
flag, requires that either all data is written or none at all. In the event of
|
||
a system crash or unexpected power loss during the write operation, the affected
|
||
region (when later read) must reflect either the complete old data or the
|
||
complete new data, but never a mix of both.
|
||
|
||
To enforce this guarantee, we ensure that the write target is backed by
|
||
a single, contiguous extent before any data is written. This is critical because
|
||
ext4 defers the conversion of unwritten extents to written extents until the I/O
|
||
completion path (typically in ->end_io()). If a write is allowed to proceed over
|
||
a mixed mapping region (with mapped and unwritten extents) and a failure occurs
|
||
mid-write, the system could observe partially updated regions after reboot, i.e.
|
||
new data over mapped areas, and stale (old) data over unwritten extents that
|
||
were never marked written. This violates the atomicity and/or torn write
|
||
prevention guarantee.
|
||
|
||
To prevent such torn writes, ext4 proactively allocates a single contiguous
|
||
extent for the entire requested region in ``ext4_iomap_alloc`` via
|
||
``ext4_map_blocks_atomic()``. EXT4 also force commits the current journalling
|
||
transaction in case if allocation is done over mixed mapping. This ensures any
|
||
pending metadata updates (like unwritten to written extents conversion) in this
|
||
range are in consistent state with the file data blocks, before performing the
|
||
actual write I/O. If the commit fails, the whole I/O must be aborted to prevent
|
||
from any possible torn writes.
|
||
Only after this step, the actual data write operation is performed by the iomap.
|
||
|
||
Handling Split Extents Across Leaf Blocks
|
||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||
|
||
There can be a special edge case where we have logically and physically
|
||
contiguous extents stored in separate leaf nodes of the on-disk extent tree.
|
||
This occurs because on-disk extent tree merges only happens within the leaf
|
||
blocks except for a case where we have 2-level tree which can get merged and
|
||
collapsed entirely into the inode.
|
||
If such a layout exists and, in the worst case, the extent status cache entries
|
||
are reclaimed due to memory pressure, ``ext4_map_blocks()`` may never return
|
||
a single contiguous extent for these split leaf extents.
|
||
|
||
To address this edge case, a new get block flag
|
||
``EXT4_GET_BLOCKS_QUERY_LEAF_BLOCKS flag`` is added to enhance the
|
||
``ext4_map_query_blocks()`` lookup behavior.
|
||
|
||
This new get block flag allows ``ext4_map_blocks()`` to first check if there is
|
||
an entry in the extent status cache for the full range.
|
||
If not present, it consults the on-disk extent tree using
|
||
``ext4_map_query_blocks()``.
|
||
If the located extent is at the end of a leaf node, it probes the next logical
|
||
block (lblk) to detect a contiguous extent in the adjacent leaf.
|
||
|
||
For now only one additional leaf block is queried to maintain efficiency, as
|
||
atomic writes are typically constrained to small sizes
|
||
(e.g. [blocksize, clustersize]).
|
||
|
||
|
||
Handling Journal transactions
|
||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||
|
||
To support multi-fsblock atomic writes, we ensure enough journal credits are
|
||
reserved during:
|
||
|
||
1. Block allocation time in ``ext4_iomap_alloc()``. We first query if there
|
||
could be a mixed mapping for the underlying requested range. If yes, then we
|
||
reserve credits of up to ``m_len``, assuming every alternate block can be
|
||
an unwritten extent followed by a hole.
|
||
|
||
2. During ``->end_io()`` call, we make sure a single transaction is started for
|
||
doing unwritten-to-written conversion. The loop for conversion is mainly
|
||
only required to handle a split extent across leaf blocks.
|
||
|
||
How to
|
||
~~~~~~
|
||
|
||
Creating Filesystems with Atomic Write Support
|
||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||
|
||
First check the atomic write units supported by block device.
|
||
See :ref:`atomic_write_bdev_support` for more details.
|
||
|
||
For single-fsblock atomic writes with a larger block size
|
||
(on systems with block size < page size):
|
||
|
||
.. code-block:: bash
|
||
|
||
# Create an ext4 filesystem with a 16KB block size
|
||
# (requires page size >= 16KB)
|
||
mkfs.ext4 -b 16384 /dev/device
|
||
|
||
For multi-fsblock atomic writes with bigalloc:
|
||
|
||
.. code-block:: bash
|
||
|
||
# Create an ext4 filesystem with bigalloc and 64KB cluster size
|
||
mkfs.ext4 -F -O bigalloc -b 4096 -C 65536 /dev/device
|
||
|
||
Where ``-b`` specifies the block size, ``-C`` specifies the cluster size in bytes,
|
||
and ``-O bigalloc`` enables the bigalloc feature.
|
||
|
||
Application Interface
|
||
^^^^^^^^^^^^^^^^^^^^^
|
||
|
||
Applications can use the ``pwritev2()`` system call with the ``RWF_ATOMIC`` flag
|
||
to perform atomic writes:
|
||
|
||
.. code-block:: c
|
||
|
||
pwritev2(fd, iov, iovcnt, offset, RWF_ATOMIC);
|
||
|
||
The write must be aligned to the filesystem's block size and not exceed the
|
||
filesystem's maximum atomic write unit size.
|
||
See ``generic_atomic_write_valid()`` for more details.
|
||
|
||
``statx()`` system call with ``STATX_WRITE_ATOMIC`` flag can provides following
|
||
details:
|
||
|
||
* ``stx_atomic_write_unit_min``: Minimum size of an atomic write request.
|
||
* ``stx_atomic_write_unit_max``: Maximum size of an atomic write request.
|
||
* ``stx_atomic_write_segments_max``: Upper limit for segments. The number of
|
||
separate memory buffers that can be gathered into a write operation
|
||
(e.g., the iovcnt parameter for IOV_ITER). Currently, this is always set to one.
|
||
|
||
The STATX_ATTR_WRITE_ATOMIC flag in ``statx->attributes`` is set if atomic
|
||
writes are supported.
|
||
|
||
.. _atomic_write_bdev_support:
|
||
|
||
Hardware Support
|
||
~~~~~~~~~~~~~~~~
|
||
|
||
The underlying storage device must support atomic write operations.
|
||
Modern NVMe and SCSI devices often provide this capability.
|
||
The Linux kernel exposes this information through sysfs:
|
||
|
||
* ``/sys/block/<device>/queue/atomic_write_unit_min`` - Minimum atomic write size
|
||
* ``/sys/block/<device>/queue/atomic_write_unit_max`` - Maximum atomic write size
|
||
|
||
Nonzero values for these attributes indicate that the device supports
|
||
atomic writes.
|
||
|
||
See Also
|
||
~~~~~~~~
|
||
|
||
* :doc:`bigalloc` - Documentation on the bigalloc feature
|
||
* :doc:`allocators` - Documentation on block allocation in ext4
|
||
* Support for atomic block writes in 6.13:
|
||
https://lwn.net/Articles/1009298/
|