linux/Documentation/filesystems/ext4/atomic_writes.rst

.. SPDX-License-Identifier: GPL-2.0
.. _atomic_writes:

Atomic Block Writes
-------------------------

Introduction
~~~~~~~~~~~~

Atomic (untorn) block writes ensure that either the entire write is committed
to disk or none of it is. This prevents "torn writes" during power loss or
system crashes. The ext4 filesystem supports atomic writes (only with Direct
I/O) on regular files with extents, provided the underlying storage device
supports hardware atomic writes. This is supported in the following two ways:

1. **Single-fsblock Atomic Writes**:
   EXT4's supports atomic write operations with a single filesystem block since
   v6.13. In this the atomic write unit minimum and maximum sizes are both set
   to filesystem blocksize.
   e.g. doing atomic write of 16KB with 16KB filesystem blocksize on 64KB
   pagesize system is possible.

2. **Multi-fsblock Atomic Writes with Bigalloc**:
   EXT4 now also supports atomic writes spanning multiple filesystem blocks
   using a feature known as bigalloc. The atomic write unit's minimum and
   maximum sizes are determined by the filesystem block size and cluster size,
   based on the underlying device’s supported atomic write unit limits.

Requirements
~~~~~~~~~~~~

Basic requirements for atomic writes in ext4:

 1. The extents feature must be enabled (default for ext4)
 2. The underlying block device must support atomic writes
 3. For single-fsblock atomic writes:

    1. A filesystem with appropriate block size (up to the page size)
 4. For multi-fsblock atomic writes:

    1. The bigalloc feature must be enabled
    2. The cluster size must be appropriately configured

NOTE: EXT4 does not support software or COW based atomic write, which means
atomic writes on ext4 are only supported if underlying storage device supports
it.

Multi-fsblock Implementation Details
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The bigalloc feature changes ext4 to allocate in units of multiple filesystem
blocks, also known as clusters. With bigalloc each bit within block bitmap
represents cluster (power of 2 number of blocks) rather than individual
filesystem blocks.
EXT4 supports multi-fsblock atomic writes with bigalloc, subject to the
following constraints. The minimum atomic write size is the larger of the fs
block size and the minimum hardware atomic write unit; and the maximum atomic
write size is smaller of the bigalloc cluster size and the maximum hardware
atomic write unit.  Bigalloc ensures that all allocations are aligned to the
cluster size, which satisfies the LBA alignment requirements of the hardware
device if the start of the partition/logical volume is itself aligned correctly.

Here is the block allocation strategy in bigalloc for atomic writes:

 * For regions with fully mapped extents, no additional work is needed
 * For append writes, a new mapped extent is allocated
 * For regions that are entirely holes, unwritten extent is created
 * For large unwritten extents, the extent gets split into two unwritten
   extents of appropriate requested size
 * For mixed mapping regions (combinations of holes, unwritten extents, or
   mapped extents), ext4_map_blocks() is called in a loop with
   EXT4_GET_BLOCKS_ZERO flag to convert the region into a single contiguous
   mapped extent by writing zeroes to it and converting any unwritten extents to
   written, if found within the range.

Note: Writing on a single contiguous underlying extent, whether mapped or
unwritten, is not inherently problematic. However, writing to a mixed mapping
region (i.e. one containing a combination of mapped and unwritten extents)
must be avoided when performing atomic writes.

The reason is that, atomic writes when issued via pwritev2() with the RWF_ATOMIC
flag, requires that either all data is written or none at all. In the event of
a system crash or unexpected power loss during the write operation, the affected
region (when later read) must reflect either the complete old data or the
complete new data, but never a mix of both.

To enforce this guarantee, we ensure that the write target is backed by
a single, contiguous extent before any data is written. This is critical because
ext4 defers the conversion of unwritten extents to written extents until the I/O
completion path (typically in ->end_io()). If a write is allowed to proceed over
a mixed mapping region (with mapped and unwritten extents) and a failure occurs
mid-write, the system could observe partially updated regions after reboot, i.e.
new data over mapped areas, and stale (old) data over unwritten extents that
were never marked written. This violates the atomicity and/or torn write
prevention guarantee.

To prevent such torn writes, ext4 proactively allocates a single contiguous
extent for the entire requested region in ``ext4_iomap_alloc`` via
``ext4_map_blocks_atomic()``. EXT4 also force commits the current journalling
transaction in case if allocation is done over mixed mapping. This ensures any
pending metadata updates (like unwritten to written extents conversion) in this
range are in consistent state with the file data blocks, before performing the
actual write I/O. If the commit fails, the whole I/O must be aborted to prevent
from any possible torn writes.
Only after this step, the actual data write operation is performed by the iomap.

Handling Split Extents Across Leaf Blocks
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

There can be a special edge case where we have logically and physically
contiguous extents stored in separate leaf nodes of the on-disk extent tree.
This occurs because on-disk extent tree merges only happens within the leaf
blocks except for a case where we have 2-level tree which can get merged and
collapsed entirely into the inode.
If such a layout exists and, in the worst case, the extent status cache entries
are reclaimed due to memory pressure, ``ext4_map_blocks()`` may never return
a single contiguous extent for these split leaf extents.

To address this edge case, a new get block flag
``EXT4_GET_BLOCKS_QUERY_LEAF_BLOCKS flag`` is added to enhance the
``ext4_map_query_blocks()`` lookup behavior.

This new get block flag allows ``ext4_map_blocks()`` to first check if there is
an entry in the extent status cache for the full range.
If not present, it consults the on-disk extent tree using
``ext4_map_query_blocks()``.
If the located extent is at the end of a leaf node, it probes the next logical
block (lblk) to detect a contiguous extent in the adjacent leaf.

For now only one additional leaf block is queried to maintain efficiency, as
atomic writes are typically constrained to small sizes
(e.g. [blocksize, clustersize]).


Handling Journal transactions
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

To support multi-fsblock atomic writes, we ensure enough journal credits are
reserved during:

 1. Block allocation time in ``ext4_iomap_alloc()``. We first query if there
    could be a mixed mapping for the underlying requested range. If yes, then we
    reserve credits of up to ``m_len``, assuming every alternate block can be
    an unwritten extent followed by a hole.

 2. During ``->end_io()`` call, we make sure a single transaction is started for
    doing unwritten-to-written conversion. The loop for conversion is mainly
    only required to handle a split extent across leaf blocks.

How to
~~~~~~

Creating Filesystems with Atomic Write Support
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

First check the atomic write units supported by block device.
See :ref:`atomic_write_bdev_support` for more details.

For single-fsblock atomic writes with a larger block size
(on systems with block size < page size):

.. code-block:: bash

    # Create an ext4 filesystem with a 16KB block size
    # (requires page size >= 16KB)
    mkfs.ext4 -b 16384 /dev/device

For multi-fsblock atomic writes with bigalloc:

.. code-block:: bash

    # Create an ext4 filesystem with bigalloc and 64KB cluster size
    mkfs.ext4 -F -O bigalloc -b 4096 -C 65536 /dev/device

Where ``-b`` specifies the block size, ``-C`` specifies the cluster size in bytes,
and ``-O bigalloc`` enables the bigalloc feature.

Application Interface
^^^^^^^^^^^^^^^^^^^^^

Applications can use the ``pwritev2()`` system call with the ``RWF_ATOMIC`` flag
to perform atomic writes:

.. code-block:: c

    pwritev2(fd, iov, iovcnt, offset, RWF_ATOMIC);

The write must be aligned to the filesystem's block size and not exceed the
filesystem's maximum atomic write unit size.
See ``generic_atomic_write_valid()`` for more details.

``statx()`` system call with ``STATX_WRITE_ATOMIC`` flag can provides following
details:

 * ``stx_atomic_write_unit_min``: Minimum size of an atomic write request.
 * ``stx_atomic_write_unit_max``: Maximum size of an atomic write request.
 * ``stx_atomic_write_segments_max``: Upper limit for segments. The number of
   separate memory buffers that can be gathered into a write operation
   (e.g., the iovcnt parameter for IOV_ITER). Currently, this is always set to one.

The STATX_ATTR_WRITE_ATOMIC flag in ``statx->attributes`` is set if atomic
writes are supported.

.. _atomic_write_bdev_support:

Hardware Support
~~~~~~~~~~~~~~~~

The underlying storage device must support atomic write operations.
Modern NVMe and SCSI devices often provide this capability.
The Linux kernel exposes this information through sysfs:

* ``/sys/block/<device>/queue/atomic_write_unit_min`` - Minimum atomic write size
* ``/sys/block/<device>/queue/atomic_write_unit_max`` - Maximum atomic write size

Nonzero values for these attributes indicate that the device supports
atomic writes.

See Also
~~~~~~~~

* :doc:`bigalloc` - Documentation on the bigalloc feature
* :doc:`allocators` - Documentation on block allocation in ext4
* Support for atomic block writes in 6.13:
  https://lwn.net/Articles/1009298/