mirror of
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
synced 2025-08-05 16:54:27 +00:00

This flag was added by commit3d3c950467
("netfs: Provide readahead and readpage netfs helpers") but its only user was removed by commit86b374d061
("netfs: Remove fs/netfs/io.c"). Signed-off-by: Max Kellermann <max.kellermann@ionos.com> Signed-off-by: David Howells <dhowells@redhat.com> Link: https://lore.kernel.org/20250519134813.2975312-3-dhowells@redhat.com cc: Paulo Alcantara <pc@manguebit.com> cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
1051 lines
39 KiB
ReStructuredText
1051 lines
39 KiB
ReStructuredText
.. SPDX-License-Identifier: GPL-2.0
|
|
|
|
===================================
|
|
Network Filesystem Services Library
|
|
===================================
|
|
|
|
.. Contents:
|
|
|
|
- Overview.
|
|
- Requests and streams.
|
|
- Subrequests.
|
|
- Result collection and retry.
|
|
- Local caching.
|
|
- Content encryption (fscrypt).
|
|
- Per-inode context.
|
|
- Inode context helper functions.
|
|
- Inode locking.
|
|
- Inode writeback.
|
|
- High-level VFS API.
|
|
- Unlocked read/write iter.
|
|
- Pre-locked read/write iter.
|
|
- Monolithic files API.
|
|
- Memory-mapped I/O API.
|
|
- High-level VM API.
|
|
- Deprecated PG_private2 API.
|
|
- I/O request API.
|
|
- Request structure.
|
|
- Stream structure.
|
|
- Subrequest structure.
|
|
- Filesystem methods.
|
|
- Terminating a subrequest.
|
|
- Local cache API.
|
|
- API function reference.
|
|
|
|
|
|
Overview
|
|
========
|
|
|
|
The network filesystem services library, netfslib, is a set of functions
|
|
designed to aid a network filesystem in implementing VM/VFS API operations. It
|
|
takes over the normal buffered read, readahead, write and writeback and also
|
|
handles unbuffered and direct I/O.
|
|
|
|
The library provides support for (re-)negotiation of I/O sizes and retrying
|
|
failed I/O as well as local caching and will, in the future, provide content
|
|
encryption.
|
|
|
|
It insulates the filesystem from VM interface changes as much as possible and
|
|
handles VM features such as large multipage folios. The filesystem basically
|
|
just has to provide a way to perform read and write RPC calls.
|
|
|
|
The way I/O is organised inside netfslib consists of a number of objects:
|
|
|
|
* A *request*. A request is used to track the progress of the I/O overall and
|
|
to hold on to resources. The collection of results is done at the request
|
|
level. The I/O within a request is divided into a number of parallel
|
|
streams of subrequests.
|
|
|
|
* A *stream*. A non-overlapping series of subrequests. The subrequests
|
|
within a stream do not have to be contiguous.
|
|
|
|
* A *subrequest*. This is the basic unit of I/O. It represents a single RPC
|
|
call or a single cache I/O operation. The library passes these to the
|
|
filesystem and the cache to perform.
|
|
|
|
Requests and Streams
|
|
--------------------
|
|
|
|
When actually performing I/O (as opposed to just copying into the pagecache),
|
|
netfslib will create one or more requests to track the progress of the I/O and
|
|
to hold resources.
|
|
|
|
A read operation will have a single stream and the subrequests within that
|
|
stream may be of mixed origins, for instance mixing RPC subrequests and cache
|
|
subrequests.
|
|
|
|
On the other hand, a write operation may have multiple streams, where each
|
|
stream targets a different destination. For instance, there may be one stream
|
|
writing to the local cache and one to the server. Currently, only two streams
|
|
are allowed, but this could be increased if parallel writes to multiple servers
|
|
is desired.
|
|
|
|
The subrequests within a write stream do not need to match alignment or size
|
|
with the subrequests in another write stream and netfslib performs the tiling
|
|
of subrequests in each stream over the source buffer independently. Further,
|
|
each stream may contain holes that don't correspond to holes in the other
|
|
stream.
|
|
|
|
In addition, the subrequests do not need to correspond to the boundaries of the
|
|
folios or vectors in the source/destination buffer. The library handles the
|
|
collection of results and the wrangling of folio flags and references.
|
|
|
|
Subrequests
|
|
-----------
|
|
|
|
Subrequests are at the heart of the interaction between netfslib and the
|
|
filesystem using it. Each subrequest is expected to correspond to a single
|
|
read or write RPC or cache operation. The library will stitch together the
|
|
results from a set of subrequests to provide a higher level operation.
|
|
|
|
Netfslib has two interactions with the filesystem or the cache when setting up
|
|
a subrequest. First, there's an optional preparatory step that allows the
|
|
filesystem to negotiate the limits on the subrequest, both in terms of maximum
|
|
number of bytes and maximum number of vectors (e.g. for RDMA). This may
|
|
involve negotiating with the server (e.g. cifs needing to acquire credits).
|
|
|
|
And, secondly, there's the issuing step in which the subrequest is handed off
|
|
to the filesystem to perform.
|
|
|
|
Note that these two steps are done slightly differently between read and write:
|
|
|
|
* For reads, the VM/VFS tells us how much is being requested up front, so the
|
|
library can preset maximum values that the cache and then the filesystem can
|
|
then reduce. The cache also gets consulted first on whether it wants to do
|
|
a read before the filesystem is consulted.
|
|
|
|
* For writeback, it is unknown how much there will be to write until the
|
|
pagecache is walked, so no limit is set by the library.
|
|
|
|
Once a subrequest is completed, the filesystem or cache informs the library of
|
|
the completion and then collection is invoked. Depending on whether the
|
|
request is synchronous or asynchronous, the collection of results will be done
|
|
in either the application thread or in a work queue.
|
|
|
|
Result Collection and Retry
|
|
---------------------------
|
|
|
|
As subrequests complete, the results are collected and collated by the library
|
|
and folio unlocking is performed progressively (if appropriate). Once the
|
|
request is complete, async completion will be invoked (again, if appropriate).
|
|
It is possible for the filesystem to provide interim progress reports to the
|
|
library to cause folio unlocking to happen earlier if possible.
|
|
|
|
If any subrequests fail, netfslib can retry them. It will wait until all
|
|
subrequests are completed, offer the filesystem the opportunity to fiddle with
|
|
the resources/state held by the request and poke at the subrequests before
|
|
re-preparing and re-issuing the subrequests.
|
|
|
|
This allows the tiling of contiguous sets of failed subrequest within a stream
|
|
to be changed, adding more subrequests or ditching excess as necessary (for
|
|
instance, if the network sizes change or the server decides it wants smaller
|
|
chunks).
|
|
|
|
Further, if one or more contiguous cache-read subrequests fail, the library
|
|
will pass them to the filesystem to perform instead, renegotiating and retiling
|
|
them as necessary to fit with the filesystem's parameters rather than those of
|
|
the cache.
|
|
|
|
Local Caching
|
|
-------------
|
|
|
|
One of the services netfslib provides, via ``fscache``, is the option to cache
|
|
on local disk a copy of the data obtained from/written to a network filesystem.
|
|
The library will manage the storing, retrieval and some invalidation of data
|
|
automatically on behalf of the filesystem if a cookie is attached to the
|
|
``netfs_inode``.
|
|
|
|
Note that local caching used to use the PG_private_2 (aliased as PG_fscache) to
|
|
keep track of a page that was being written to the cache, but this is now
|
|
deprecated as PG_private_2 will be removed.
|
|
|
|
Instead, folios that are read from the server for which there was no data in
|
|
the cache will be marked as dirty and will have ``folio->private`` set to a
|
|
special value (``NETFS_FOLIO_COPY_TO_CACHE``) and left to writeback to write.
|
|
If the folio is modified before that happened, the special value will be
|
|
cleared and the write will become normally dirty.
|
|
|
|
When writeback occurs, folios that are so marked will only be written to the
|
|
cache and not to the server. Writeback handles mixed cache-only writes and
|
|
server-and-cache writes by using two streams, sending one to the cache and one
|
|
to the server. The server stream will have gaps in it corresponding to those
|
|
folios.
|
|
|
|
Content Encryption (fscrypt)
|
|
----------------------------
|
|
|
|
Though it does not do so yet, at some point netfslib will acquire the ability
|
|
to do client-side content encryption on behalf of the network filesystem (Ceph,
|
|
for example). fscrypt can be used for this if appropriate (it may not be -
|
|
cifs, for example).
|
|
|
|
The data will be stored encrypted in the local cache using the same manner of
|
|
encryption as the data written to the server and the library will impose bounce
|
|
buffering and RMW cycles as necessary.
|
|
|
|
|
|
Per-Inode Context
|
|
=================
|
|
|
|
The network filesystem helper library needs a place to store a bit of state for
|
|
its use on each netfs inode it is helping to manage. To this end, a context
|
|
structure is defined::
|
|
|
|
struct netfs_inode {
|
|
struct inode inode;
|
|
const struct netfs_request_ops *ops;
|
|
struct fscache_cookie * cache;
|
|
loff_t remote_i_size;
|
|
unsigned long flags;
|
|
...
|
|
};
|
|
|
|
A network filesystem that wants to use netfslib must place one of these in its
|
|
inode wrapper struct instead of the VFS ``struct inode``. This can be done in
|
|
a way similar to the following::
|
|
|
|
struct my_inode {
|
|
struct netfs_inode netfs; /* Netfslib context and vfs inode */
|
|
...
|
|
};
|
|
|
|
This allows netfslib to find its state by using ``container_of()`` from the
|
|
inode pointer, thereby allowing the netfslib helper functions to be pointed to
|
|
directly by the VFS/VM operation tables.
|
|
|
|
The structure contains the following fields that are of interest to the
|
|
filesystem:
|
|
|
|
* ``inode``
|
|
|
|
The VFS inode structure.
|
|
|
|
* ``ops``
|
|
|
|
The set of operations provided by the network filesystem to netfslib.
|
|
|
|
* ``cache``
|
|
|
|
Local caching cookie, or NULL if no caching is enabled. This field does not
|
|
exist if fscache is disabled.
|
|
|
|
* ``remote_i_size``
|
|
|
|
The size of the file on the server. This differs from inode->i_size if
|
|
local modifications have been made but not yet written back.
|
|
|
|
* ``flags``
|
|
|
|
A set of flags, some of which the filesystem might be interested in:
|
|
|
|
* ``NETFS_ICTX_MODIFIED_ATTR``
|
|
|
|
Set if netfslib modifies mtime/ctime. The filesystem is free to ignore
|
|
this or clear it.
|
|
|
|
* ``NETFS_ICTX_UNBUFFERED``
|
|
|
|
Do unbuffered I/O upon the file. Like direct I/O but without the
|
|
alignment limitations. RMW will be performed if necessary. The pagecache
|
|
will not be used unless mmap() is also used.
|
|
|
|
* ``NETFS_ICTX_WRITETHROUGH``
|
|
|
|
Do writethrough caching upon the file. I/O will be set up and dispatched
|
|
as buffered writes are made to the page cache. mmap() does the normal
|
|
writeback thing.
|
|
|
|
* ``NETFS_ICTX_SINGLE_NO_UPLOAD``
|
|
|
|
Set if the file has a monolithic content that must be read entirely in a
|
|
single go and must not be written back to the server, though it can be
|
|
cached (e.g. AFS directories).
|
|
|
|
Inode Context Helper Functions
|
|
------------------------------
|
|
|
|
To help deal with the per-inode context, a number helper functions are
|
|
provided. Firstly, a function to perform basic initialisation on a context and
|
|
set the operations table pointer::
|
|
|
|
void netfs_inode_init(struct netfs_inode *ctx,
|
|
const struct netfs_request_ops *ops);
|
|
|
|
then a function to cast from the VFS inode structure to the netfs context::
|
|
|
|
struct netfs_inode *netfs_inode(struct inode *inode);
|
|
|
|
and finally, a function to get the cache cookie pointer from the context
|
|
attached to an inode (or NULL if fscache is disabled)::
|
|
|
|
struct fscache_cookie *netfs_i_cookie(struct netfs_inode *ctx);
|
|
|
|
Inode Locking
|
|
-------------
|
|
|
|
A number of functions are provided to manage the locking of i_rwsem for I/O and
|
|
to effectively extend it to provide more separate classes of exclusion::
|
|
|
|
int netfs_start_io_read(struct inode *inode);
|
|
void netfs_end_io_read(struct inode *inode);
|
|
int netfs_start_io_write(struct inode *inode);
|
|
void netfs_end_io_write(struct inode *inode);
|
|
int netfs_start_io_direct(struct inode *inode);
|
|
void netfs_end_io_direct(struct inode *inode);
|
|
|
|
The exclusion breaks down into four separate classes:
|
|
|
|
1) Buffered reads and writes.
|
|
|
|
Buffered reads can run concurrently each other and with buffered writes,
|
|
but buffered writes cannot run concurrently with each other.
|
|
|
|
2) Direct reads and writes.
|
|
|
|
Direct (and unbuffered) reads and writes can run concurrently since they do
|
|
not share local buffering (i.e. the pagecache) and, in a network
|
|
filesystem, are expected to have exclusion managed on the server (though
|
|
this may not be the case for, say, Ceph).
|
|
|
|
3) Other major inode modifying operations (e.g. truncate, fallocate).
|
|
|
|
These should just access i_rwsem directly.
|
|
|
|
4) mmap().
|
|
|
|
mmap'd accesses might operate concurrently with any of the other classes.
|
|
They might form the buffer for an intra-file loopback DIO read/write. They
|
|
might be permitted on unbuffered files.
|
|
|
|
Inode Writeback
|
|
---------------
|
|
|
|
Netfslib will pin resources on an inode for future writeback (such as pinning
|
|
use of an fscache cookie) when an inode is dirtied. However, this pinning
|
|
needs careful management. To manage the pinning, the following sequence
|
|
occurs:
|
|
|
|
1) An inode state flag ``I_PINNING_NETFS_WB`` is set by netfslib when the
|
|
pinning begins (when a folio is dirtied, for example) if the cache is
|
|
active to stop the cache structures from being discarded and the cache
|
|
space from being culled. This also prevents re-getting of cache resources
|
|
if the flag is already set.
|
|
|
|
2) This flag then cleared inside the inode lock during inode writeback in the
|
|
VM - and the fact that it was set is transferred to ``->unpinned_netfs_wb``
|
|
in ``struct writeback_control``.
|
|
|
|
3) If ``->unpinned_netfs_wb`` is now set, the write_inode procedure is forced.
|
|
|
|
4) The filesystem's ``->write_inode()`` function is invoked to do the cleanup.
|
|
|
|
5) The filesystem invokes netfs to do its cleanup.
|
|
|
|
To do the cleanup, netfslib provides a function to do the resource unpinning::
|
|
|
|
int netfs_unpin_writeback(struct inode *inode, struct writeback_control *wbc);
|
|
|
|
If the filesystem doesn't need to do anything else, this may be set as a its
|
|
``.write_inode`` method.
|
|
|
|
Further, if an inode is deleted, the filesystem's write_inode method may not
|
|
get called, so::
|
|
|
|
void netfs_clear_inode_writeback(struct inode *inode, const void *aux);
|
|
|
|
must be called from ``->evict_inode()`` *before* ``clear_inode()`` is called.
|
|
|
|
|
|
High-Level VFS API
|
|
==================
|
|
|
|
Netfslib provides a number of sets of API calls for the filesystem to delegate
|
|
VFS operations to. Netfslib, in turn, will call out to the filesystem and the
|
|
cache to negotiate I/O sizes, issue RPCs and provide places for it to intervene
|
|
at various times.
|
|
|
|
Unlocked Read/Write Iter
|
|
------------------------
|
|
|
|
The first API set is for the delegation of operations to netfslib when the
|
|
filesystem is called through the standard VFS read/write_iter methods::
|
|
|
|
ssize_t netfs_file_read_iter(struct kiocb *iocb, struct iov_iter *iter);
|
|
ssize_t netfs_file_write_iter(struct kiocb *iocb, struct iov_iter *from);
|
|
ssize_t netfs_buffered_read_iter(struct kiocb *iocb, struct iov_iter *iter);
|
|
ssize_t netfs_unbuffered_read_iter(struct kiocb *iocb, struct iov_iter *iter);
|
|
ssize_t netfs_unbuffered_write_iter(struct kiocb *iocb, struct iov_iter *from);
|
|
|
|
They can be assigned directly to ``.read_iter`` and ``.write_iter``. They
|
|
perform the inode locking themselves and the first two will switch between
|
|
buffered I/O and DIO as appropriate.
|
|
|
|
Pre-Locked Read/Write Iter
|
|
--------------------------
|
|
|
|
The second API set is for the delegation of operations to netfslib when the
|
|
filesystem is called through the standard VFS methods, but needs to do some
|
|
other stuff before or after calling netfslib whilst still inside locked section
|
|
(e.g. Ceph negotiating caps). The unbuffered read function is::
|
|
|
|
ssize_t netfs_unbuffered_read_iter_locked(struct kiocb *iocb, struct iov_iter *iter);
|
|
|
|
This must not be assigned directly to ``.read_iter`` and the filesystem is
|
|
responsible for performing the inode locking before calling it. In the case of
|
|
buffered read, the filesystem should use ``filemap_read()``.
|
|
|
|
There are three functions for writes::
|
|
|
|
ssize_t netfs_buffered_write_iter_locked(struct kiocb *iocb, struct iov_iter *from,
|
|
struct netfs_group *netfs_group);
|
|
ssize_t netfs_perform_write(struct kiocb *iocb, struct iov_iter *iter,
|
|
struct netfs_group *netfs_group);
|
|
ssize_t netfs_unbuffered_write_iter_locked(struct kiocb *iocb, struct iov_iter *iter,
|
|
struct netfs_group *netfs_group);
|
|
|
|
These must not be assigned directly to ``.write_iter`` and the filesystem is
|
|
responsible for performing the inode locking before calling them.
|
|
|
|
The first two functions are for buffered writes; the first just adds some
|
|
standard write checks and jumps to the second, but if the filesystem wants to
|
|
do the checks itself, it can use the second directly. The third function is
|
|
for unbuffered or DIO writes.
|
|
|
|
On all three write functions, there is a writeback group pointer (which should
|
|
be NULL if the filesystem doesn't use this). Writeback groups are set on
|
|
folios when they're modified. If a folio to-be-modified is already marked with
|
|
a different group, it is flushed first. The writeback API allows writing back
|
|
of a specific group.
|
|
|
|
Memory-Mapped I/O API
|
|
---------------------
|
|
|
|
An API for support of mmap()'d I/O is provided::
|
|
|
|
vm_fault_t netfs_page_mkwrite(struct vm_fault *vmf, struct netfs_group *netfs_group);
|
|
|
|
This allows the filesystem to delegate ``.page_mkwrite`` to netfslib. The
|
|
filesystem should not take the inode lock before calling it, but, as with the
|
|
locked write functions above, this does take a writeback group pointer. If the
|
|
page to be made writable is in a different group, it will be flushed first.
|
|
|
|
Monolithic Files API
|
|
--------------------
|
|
|
|
There is also a special API set for files for which the content must be read in
|
|
a single RPC (and not written back) and is maintained as a monolithic blob
|
|
(e.g. an AFS directory), though it can be stored and updated in the local cache::
|
|
|
|
ssize_t netfs_read_single(struct inode *inode, struct file *file, struct iov_iter *iter);
|
|
void netfs_single_mark_inode_dirty(struct inode *inode);
|
|
int netfs_writeback_single(struct address_space *mapping,
|
|
struct writeback_control *wbc,
|
|
struct iov_iter *iter);
|
|
|
|
The first function reads from a file into the given buffer, reading from the
|
|
cache in preference if the data is cached there; the second function allows the
|
|
inode to be marked dirty, causing a later writeback; and the third function can
|
|
be called from the writeback code to write the data to the cache, if there is
|
|
one.
|
|
|
|
The inode should be marked ``NETFS_ICTX_SINGLE_NO_UPLOAD`` if this API is to be
|
|
used. The writeback function requires the buffer to be of ITER_FOLIOQ type.
|
|
|
|
High-Level VM API
|
|
==================
|
|
|
|
Netfslib also provides a number of sets of API calls for the filesystem to
|
|
delegate VM operations to. Again, netfslib, in turn, will call out to the
|
|
filesystem and the cache to negotiate I/O sizes, issue RPCs and provide places
|
|
for it to intervene at various times::
|
|
|
|
void netfs_readahead(struct readahead_control *);
|
|
int netfs_read_folio(struct file *, struct folio *);
|
|
int netfs_writepages(struct address_space *mapping,
|
|
struct writeback_control *wbc);
|
|
bool netfs_dirty_folio(struct address_space *mapping, struct folio *folio);
|
|
void netfs_invalidate_folio(struct folio *folio, size_t offset, size_t length);
|
|
bool netfs_release_folio(struct folio *folio, gfp_t gfp);
|
|
|
|
These are ``address_space_operations`` methods and can be set directly in the
|
|
operations table.
|
|
|
|
Deprecated PG_private_2 API
|
|
---------------------------
|
|
|
|
There is also a deprecated function for filesystems that still use the
|
|
``->write_begin`` method::
|
|
|
|
int netfs_write_begin(struct netfs_inode *inode, struct file *file,
|
|
struct address_space *mapping, loff_t pos, unsigned int len,
|
|
struct folio **_folio, void **_fsdata);
|
|
|
|
It uses the deprecated PG_private_2 flag and so should not be used.
|
|
|
|
|
|
I/O Request API
|
|
===============
|
|
|
|
The I/O request API comprises a number of structures and a number of functions
|
|
that the filesystem may need to use.
|
|
|
|
Request Structure
|
|
-----------------
|
|
|
|
The request structure manages the request as a whole, holding some resources
|
|
and state on behalf of the filesystem and tracking the collection of results::
|
|
|
|
struct netfs_io_request {
|
|
enum netfs_io_origin origin;
|
|
struct inode *inode;
|
|
struct address_space *mapping;
|
|
struct netfs_group *group;
|
|
struct netfs_io_stream io_streams[];
|
|
void *netfs_priv;
|
|
void *netfs_priv2;
|
|
unsigned long long start;
|
|
unsigned long long len;
|
|
unsigned long long i_size;
|
|
unsigned int debug_id;
|
|
unsigned long flags;
|
|
...
|
|
};
|
|
|
|
Many of the fields are for internal use, but the fields shown here are of
|
|
interest to the filesystem:
|
|
|
|
* ``origin``
|
|
|
|
The origin of the request (readahead, read_folio, DIO read, writeback, ...).
|
|
|
|
* ``inode``
|
|
* ``mapping``
|
|
|
|
The inode and the address space of the file being read from. The mapping
|
|
may or may not point to inode->i_data.
|
|
|
|
* ``group``
|
|
|
|
The writeback group this request is dealing with or NULL. This holds a ref
|
|
on the group.
|
|
|
|
* ``io_streams``
|
|
|
|
The parallel streams of subrequests available to the request. Currently two
|
|
are available, but this may be made extensible in future. ``NR_IO_STREAMS``
|
|
indicates the size of the array.
|
|
|
|
* ``netfs_priv``
|
|
* ``netfs_priv2``
|
|
|
|
The network filesystem's private data. The value for this can be passed in
|
|
to the helper functions or set during the request.
|
|
|
|
* ``start``
|
|
* ``len``
|
|
|
|
The file position of the start of the read request and the length. These
|
|
may be altered by the ->expand_readahead() op.
|
|
|
|
* ``i_size``
|
|
|
|
The size of the file at the start of the request.
|
|
|
|
* ``debug_id``
|
|
|
|
A number allocated to this operation that can be displayed in trace lines
|
|
for reference.
|
|
|
|
* ``flags``
|
|
|
|
Flags for managing and controlling the operation of the request. Some of
|
|
these may be of interest to the filesystem:
|
|
|
|
* ``NETFS_RREQ_RETRYING``
|
|
|
|
Netfslib sets this when generating retries.
|
|
|
|
* ``NETFS_RREQ_PAUSE``
|
|
|
|
The filesystem can set this to request to pause the library's subrequest
|
|
issuing loop - but care needs to be taken as netfslib may also set it.
|
|
|
|
* ``NETFS_RREQ_NONBLOCK``
|
|
* ``NETFS_RREQ_BLOCKED``
|
|
|
|
Netfslib sets the first to indicate that non-blocking mode was set by the
|
|
caller and the filesystem can set the second to indicate that it would
|
|
have had to block.
|
|
|
|
* ``NETFS_RREQ_USE_PGPRIV2``
|
|
|
|
The filesystem can set this if it wants to use PG_private_2 to track
|
|
whether a folio is being written to the cache. This is deprecated as
|
|
PG_private_2 is going to go away.
|
|
|
|
If the filesystem wants more private data than is afforded by this structure,
|
|
then it should wrap it and provide its own allocator.
|
|
|
|
Stream Structure
|
|
----------------
|
|
|
|
A request is comprised of one or more parallel streams and each stream may be
|
|
aimed at a different target.
|
|
|
|
For read requests, only stream 0 is used. This can contain a mixture of
|
|
subrequests aimed at different sources. For write requests, stream 0 is used
|
|
for the server and stream 1 is used for the cache. For buffered writeback,
|
|
stream 0 is not enabled unless a normal dirty folio is encountered, at which
|
|
point ->begin_writeback() will be invoked and the filesystem can mark the
|
|
stream available.
|
|
|
|
The stream struct looks like::
|
|
|
|
struct netfs_io_stream {
|
|
unsigned char stream_nr;
|
|
bool avail;
|
|
size_t sreq_max_len;
|
|
unsigned int sreq_max_segs;
|
|
unsigned int submit_extendable_to;
|
|
...
|
|
};
|
|
|
|
A number of members are available for access/use by the filesystem:
|
|
|
|
* ``stream_nr``
|
|
|
|
The number of the stream within the request.
|
|
|
|
* ``avail``
|
|
|
|
True if the stream is available for use. The filesystem should set this on
|
|
stream zero if in ->begin_writeback().
|
|
|
|
* ``sreq_max_len``
|
|
* ``sreq_max_segs``
|
|
|
|
These are set by the filesystem or the cache in ->prepare_read() or
|
|
->prepare_write() for each subrequest to indicate the maximum number of
|
|
bytes and, optionally, the maximum number of segments (if not 0) that that
|
|
subrequest can support.
|
|
|
|
* ``submit_extendable_to``
|
|
|
|
The size that a subrequest can be rounded up to beyond the EOF, given the
|
|
available buffer. This allows the cache to work out if it can do a DIO read
|
|
or write that straddles the EOF marker.
|
|
|
|
Subrequest Structure
|
|
--------------------
|
|
|
|
Individual units of I/O are managed by the subrequest structure. These
|
|
represent slices of the overall request and run independently::
|
|
|
|
struct netfs_io_subrequest {
|
|
struct netfs_io_request *rreq;
|
|
struct iov_iter io_iter;
|
|
unsigned long long start;
|
|
size_t len;
|
|
size_t transferred;
|
|
unsigned long flags;
|
|
short error;
|
|
unsigned short debug_index;
|
|
unsigned char stream_nr;
|
|
...
|
|
};
|
|
|
|
Each subrequest is expected to access a single source, though the library will
|
|
handle falling back from one source type to another. The members are:
|
|
|
|
* ``rreq``
|
|
|
|
A pointer to the read request.
|
|
|
|
* ``io_iter``
|
|
|
|
An I/O iterator representing a slice of the buffer to be read into or
|
|
written from.
|
|
|
|
* ``start``
|
|
* ``len``
|
|
|
|
The file position of the start of this slice of the read request and the
|
|
length.
|
|
|
|
* ``transferred``
|
|
|
|
The amount of data transferred so far for this subrequest. This should be
|
|
added to with the length of the transfer made by this issuance of the
|
|
subrequest. If this is less than ``len`` then the subrequest may be
|
|
reissued to continue.
|
|
|
|
* ``flags``
|
|
|
|
Flags for managing the subrequest. There are a number of interest to the
|
|
filesystem or cache:
|
|
|
|
* ``NETFS_SREQ_MADE_PROGRESS``
|
|
|
|
Set by the filesystem to indicates that at least one byte of data was read
|
|
or written.
|
|
|
|
* ``NETFS_SREQ_HIT_EOF``
|
|
|
|
The filesystem should set this if a read hit the EOF on the file (in which
|
|
case ``transferred`` should stop at the EOF). Netfslib may expand the
|
|
subrequest out to the size of the folio containing the EOF on the off
|
|
chance that a third party change happened or a DIO read may have asked for
|
|
more than is available. The library will clear any excess pagecache.
|
|
|
|
* ``NETFS_SREQ_CLEAR_TAIL``
|
|
|
|
The filesystem can set this to indicate that the remainder of the slice,
|
|
from transferred to len, should be cleared. Do not set if HIT_EOF is set.
|
|
|
|
* ``NETFS_SREQ_NEED_RETRY``
|
|
|
|
The filesystem can set this to tell netfslib to retry the subrequest.
|
|
|
|
* ``NETFS_SREQ_BOUNDARY``
|
|
|
|
This can be set by the filesystem on a subrequest to indicate that it ends
|
|
at a boundary with the filesystem structure (e.g. at the end of a Ceph
|
|
object). It tells netfslib not to retile subrequests across it.
|
|
|
|
* ``error``
|
|
|
|
This is for the filesystem to store result of the subrequest. It should be
|
|
set to 0 if successful and a negative error code otherwise.
|
|
|
|
* ``debug_index``
|
|
* ``stream_nr``
|
|
|
|
A number allocated to this slice that can be displayed in trace lines for
|
|
reference and the number of the request stream that it belongs to.
|
|
|
|
If necessary, the filesystem can get and put extra refs on the subrequest it is
|
|
given::
|
|
|
|
void netfs_get_subrequest(struct netfs_io_subrequest *subreq,
|
|
enum netfs_sreq_ref_trace what);
|
|
void netfs_put_subrequest(struct netfs_io_subrequest *subreq,
|
|
enum netfs_sreq_ref_trace what);
|
|
|
|
using netfs trace codes to indicate the reason. Care must be taken, however,
|
|
as once control of the subrequest is returned to netfslib, the same subrequest
|
|
can be reissued/retried.
|
|
|
|
Filesystem Methods
|
|
------------------
|
|
|
|
The filesystem sets a table of operations in ``netfs_inode`` for netfslib to
|
|
use::
|
|
|
|
struct netfs_request_ops {
|
|
mempool_t *request_pool;
|
|
mempool_t *subrequest_pool;
|
|
int (*init_request)(struct netfs_io_request *rreq, struct file *file);
|
|
void (*free_request)(struct netfs_io_request *rreq);
|
|
void (*free_subrequest)(struct netfs_io_subrequest *rreq);
|
|
void (*expand_readahead)(struct netfs_io_request *rreq);
|
|
int (*prepare_read)(struct netfs_io_subrequest *subreq);
|
|
void (*issue_read)(struct netfs_io_subrequest *subreq);
|
|
void (*done)(struct netfs_io_request *rreq);
|
|
void (*update_i_size)(struct inode *inode, loff_t i_size);
|
|
void (*post_modify)(struct inode *inode);
|
|
void (*begin_writeback)(struct netfs_io_request *wreq);
|
|
void (*prepare_write)(struct netfs_io_subrequest *subreq);
|
|
void (*issue_write)(struct netfs_io_subrequest *subreq);
|
|
void (*retry_request)(struct netfs_io_request *wreq,
|
|
struct netfs_io_stream *stream);
|
|
void (*invalidate_cache)(struct netfs_io_request *wreq);
|
|
};
|
|
|
|
The table starts with a pair of optional pointers to memory pools from which
|
|
requests and subrequests can be allocated. If these are not given, netfslib
|
|
has default pools that it will use instead. If the filesystem wraps the netfs
|
|
structs in its own larger structs, then it will need to use its own pools.
|
|
Netfslib will allocate directly from the pools.
|
|
|
|
The methods defined in the table are:
|
|
|
|
* ``init_request()``
|
|
* ``free_request()``
|
|
* ``free_subrequest()``
|
|
|
|
[Optional] A filesystem may implement these to initialise or clean up any
|
|
resources that it attaches to the request or subrequest.
|
|
|
|
* ``expand_readahead()``
|
|
|
|
[Optional] This is called to allow the filesystem to expand the size of a
|
|
readahead request. The filesystem gets to expand the request in both
|
|
directions, though it must retain the initial region as that may represent
|
|
an allocation already made. If local caching is enabled, it gets to expand
|
|
the request first.
|
|
|
|
Expansion is communicated by changing ->start and ->len in the request
|
|
structure. Note that if any change is made, ->len must be increased by at
|
|
least as much as ->start is reduced.
|
|
|
|
* ``prepare_read()``
|
|
|
|
[Optional] This is called to allow the filesystem to limit the size of a
|
|
subrequest. It may also limit the number of individual regions in iterator,
|
|
such as required by RDMA. This information should be set on stream zero in::
|
|
|
|
rreq->io_streams[0].sreq_max_len
|
|
rreq->io_streams[0].sreq_max_segs
|
|
|
|
The filesystem can use this, for example, to chop up a request that has to
|
|
be split across multiple servers or to put multiple reads in flight.
|
|
|
|
Zero should be returned on success and an error code otherwise.
|
|
|
|
* ``issue_read()``
|
|
|
|
[Required] Netfslib calls this to dispatch a subrequest to the server for
|
|
reading. In the subrequest, ->start, ->len and ->transferred indicate what
|
|
data should be read from the server and ->io_iter indicates the buffer to be
|
|
used.
|
|
|
|
There is no return value; the ``netfs_read_subreq_terminated()`` function
|
|
should be called to indicate that the subrequest completed either way.
|
|
->error, ->transferred and ->flags should be updated before completing. The
|
|
termination can be done asynchronously.
|
|
|
|
Note: the filesystem must not deal with setting folios uptodate, unlocking
|
|
them or dropping their refs - the library deals with this as it may have to
|
|
stitch together the results of multiple subrequests that variously overlap
|
|
the set of folios.
|
|
|
|
* ``done()``
|
|
|
|
[Optional] This is called after the folios in a read request have all been
|
|
unlocked (and marked uptodate if applicable).
|
|
|
|
* ``update_i_size()``
|
|
|
|
[Optional] This is invoked by netfslib at various points during the write
|
|
paths to ask the filesystem to update its idea of the file size. If not
|
|
given, netfslib will set i_size and i_blocks and update the local cache
|
|
cookie.
|
|
|
|
* ``post_modify()``
|
|
|
|
[Optional] This is called after netfslib writes to the pagecache or when it
|
|
allows an mmap'd page to be marked as writable.
|
|
|
|
* ``begin_writeback()``
|
|
|
|
[Optional] Netfslib calls this when processing a writeback request if it
|
|
finds a dirty page that isn't simply marked NETFS_FOLIO_COPY_TO_CACHE,
|
|
indicating it must be written to the server. This allows the filesystem to
|
|
only set up writeback resources when it knows it's going to have to perform
|
|
a write.
|
|
|
|
* ``prepare_write()``
|
|
|
|
[Optional] This is called to allow the filesystem to limit the size of a
|
|
subrequest. It may also limit the number of individual regions in iterator,
|
|
such as required by RDMA. This information should be set on stream to which
|
|
the subrequest belongs::
|
|
|
|
rreq->io_streams[subreq->stream_nr].sreq_max_len
|
|
rreq->io_streams[subreq->stream_nr].sreq_max_segs
|
|
|
|
The filesystem can use this, for example, to chop up a request that has to
|
|
be split across multiple servers or to put multiple writes in flight.
|
|
|
|
This is not permitted to return an error. Instead, in the event of failure,
|
|
``netfs_prepare_write_failed()`` must be called.
|
|
|
|
* ``issue_write()``
|
|
|
|
[Required] This is used to dispatch a subrequest to the server for writing.
|
|
In the subrequest, ->start, ->len and ->transferred indicate what data
|
|
should be written to the server and ->io_iter indicates the buffer to be
|
|
used.
|
|
|
|
There is no return value; the ``netfs_write_subreq_terminated()`` function
|
|
should be called to indicate that the subrequest completed either way.
|
|
->error, ->transferred and ->flags should be updated before completing. The
|
|
termination can be done asynchronously.
|
|
|
|
Note: the filesystem must not deal with removing the dirty or writeback
|
|
marks on folios involved in the operation and should not take refs or pins
|
|
on them, but should leave retention to netfslib.
|
|
|
|
* ``retry_request()``
|
|
|
|
[Optional] Netfslib calls this at the beginning of a retry cycle. This
|
|
allows the filesystem to examine the state of the request, the subrequests
|
|
in the indicated stream and of its own data and make adjustments or
|
|
renegotiate resources.
|
|
|
|
* ``invalidate_cache()``
|
|
|
|
[Optional] This is called by netfslib to invalidate data stored in the local
|
|
cache in the event that writing to the local cache fails, providing updated
|
|
coherency data that netfs can't provide.
|
|
|
|
Terminating a subrequest
|
|
------------------------
|
|
|
|
When a subrequest completes, there are a number of functions that the cache or
|
|
subrequest can call to inform netfslib of the status change. One function is
|
|
provided to terminate a write subrequest at the preparation stage and acts
|
|
synchronously:
|
|
|
|
* ``void netfs_prepare_write_failed(struct netfs_io_subrequest *subreq);``
|
|
|
|
Indicate that the ->prepare_write() call failed. The ``error`` field should
|
|
have been updated.
|
|
|
|
Note that ->prepare_read() can return an error as a read can simply be aborted.
|
|
Dealing with writeback failure is trickier.
|
|
|
|
The other functions are used for subrequests that got as far as being issued:
|
|
|
|
* ``void netfs_read_subreq_terminated(struct netfs_io_subrequest *subreq);``
|
|
|
|
Tell netfslib that a read subrequest has terminated. The ``error``,
|
|
``flags`` and ``transferred`` fields should have been updated.
|
|
|
|
* ``void netfs_write_subrequest_terminated(void *_op, ssize_t transferred_or_error);``
|
|
|
|
Tell netfslib that a write subrequest has terminated. Either the amount of
|
|
data processed or the negative error code can be passed in. This is
|
|
can be used as a kiocb completion function.
|
|
|
|
* ``void netfs_read_subreq_progress(struct netfs_io_subrequest *subreq);``
|
|
|
|
This is provided to optionally update netfslib on the incremental progress
|
|
of a read, allowing some folios to be unlocked early and does not actually
|
|
terminate the subrequest. The ``transferred`` field should have been
|
|
updated.
|
|
|
|
Local Cache API
|
|
---------------
|
|
|
|
Netfslib provides a separate API for a local cache to implement, though it
|
|
provides some somewhat similar routines to the filesystem request API.
|
|
|
|
Firstly, the netfs_io_request object contains a place for the cache to hang its
|
|
state::
|
|
|
|
struct netfs_cache_resources {
|
|
const struct netfs_cache_ops *ops;
|
|
void *cache_priv;
|
|
void *cache_priv2;
|
|
unsigned int debug_id;
|
|
unsigned int inval_counter;
|
|
};
|
|
|
|
This contains an operations table pointer and two private pointers plus the
|
|
debug ID of the fscache cookie for tracing purposes and an invalidation counter
|
|
that is cranked by calls to ``fscache_invalidate()`` allowing cache subrequests
|
|
to be invalidated after completion.
|
|
|
|
The cache operation table looks like the following::
|
|
|
|
struct netfs_cache_ops {
|
|
void (*end_operation)(struct netfs_cache_resources *cres);
|
|
void (*expand_readahead)(struct netfs_cache_resources *cres,
|
|
loff_t *_start, size_t *_len, loff_t i_size);
|
|
enum netfs_io_source (*prepare_read)(struct netfs_io_subrequest *subreq,
|
|
loff_t i_size);
|
|
int (*read)(struct netfs_cache_resources *cres,
|
|
loff_t start_pos,
|
|
struct iov_iter *iter,
|
|
bool seek_data,
|
|
netfs_io_terminated_t term_func,
|
|
void *term_func_priv);
|
|
void (*prepare_write_subreq)(struct netfs_io_subrequest *subreq);
|
|
void (*issue_write)(struct netfs_io_subrequest *subreq);
|
|
};
|
|
|
|
With a termination handler function pointer::
|
|
|
|
typedef void (*netfs_io_terminated_t)(void *priv,
|
|
ssize_t transferred_or_error,
|
|
bool was_async);
|
|
|
|
The methods defined in the table are:
|
|
|
|
* ``end_operation()``
|
|
|
|
[Required] Called to clean up the resources at the end of the read request.
|
|
|
|
* ``expand_readahead()``
|
|
|
|
[Optional] Called at the beginning of a readahead operation to allow the
|
|
cache to expand a request in either direction. This allows the cache to
|
|
size the request appropriately for the cache granularity.
|
|
|
|
* ``prepare_read()``
|
|
|
|
[Required] Called to configure the next slice of a request. ->start and
|
|
->len in the subrequest indicate where and how big the next slice can be;
|
|
the cache gets to reduce the length to match its granularity requirements.
|
|
|
|
The function is passed pointers to the start and length in its parameters,
|
|
plus the size of the file for reference, and adjusts the start and length
|
|
appropriately. It should return one of:
|
|
|
|
* ``NETFS_FILL_WITH_ZEROES``
|
|
* ``NETFS_DOWNLOAD_FROM_SERVER``
|
|
* ``NETFS_READ_FROM_CACHE``
|
|
* ``NETFS_INVALID_READ``
|
|
|
|
to indicate whether the slice should just be cleared or whether it should be
|
|
downloaded from the server or read from the cache - or whether slicing
|
|
should be given up at the current point.
|
|
|
|
* ``read()``
|
|
|
|
[Required] Called to read from the cache. The start file offset is given
|
|
along with an iterator to read to, which gives the length also. It can be
|
|
given a hint requesting that it seek forward from that start position for
|
|
data.
|
|
|
|
Also provided is a pointer to a termination handler function and private
|
|
data to pass to that function. The termination function should be called
|
|
with the number of bytes transferred or an error code, plus a flag
|
|
indicating whether the termination is definitely happening in the caller's
|
|
context.
|
|
|
|
* ``prepare_write_subreq()``
|
|
|
|
[Required] This is called to allow the cache to limit the size of a
|
|
subrequest. It may also limit the number of individual regions in iterator,
|
|
such as required by DIO/DMA. This information should be set on stream to
|
|
which the subrequest belongs::
|
|
|
|
rreq->io_streams[subreq->stream_nr].sreq_max_len
|
|
rreq->io_streams[subreq->stream_nr].sreq_max_segs
|
|
|
|
The filesystem can use this, for example, to chop up a request that has to
|
|
be split across multiple servers or to put multiple writes in flight.
|
|
|
|
This is not permitted to return an error. In the event of failure,
|
|
``netfs_prepare_write_failed()`` must be called.
|
|
|
|
* ``issue_write()``
|
|
|
|
[Required] This is used to dispatch a subrequest to the cache for writing.
|
|
In the subrequest, ->start, ->len and ->transferred indicate what data
|
|
should be written to the cache and ->io_iter indicates the buffer to be
|
|
used.
|
|
|
|
There is no return value; the ``netfs_write_subreq_terminated()`` function
|
|
should be called to indicate that the subrequest completed either way.
|
|
->error, ->transferred and ->flags should be updated before completing. The
|
|
termination can be done asynchronously.
|
|
|
|
|
|
API Function Reference
|
|
======================
|
|
|
|
.. kernel-doc:: include/linux/netfs.h
|
|
.. kernel-doc:: fs/netfs/buffered_read.c
|