License cleanup: add SPDX GPL-2.0 license identifier to files with no license
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.
By default all files without license information are under the default
license of the kernel, which is GPL version 2.
Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.
This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.
How this work was done:
Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,
Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.
The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.
The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.
Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if <5
lines).
All documentation files were explicitly excluded.
The following heuristics were used to determine which SPDX license
identifiers to apply.
- when both scanners couldn't find any license traces, file was
considered to have no license information in it, and the top level
COPYING file license applied.
For non */uapi/* files that summary was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 11139
and resulted in the first patch in this series.
If that file was a */uapi/* path one, it was "GPL-2.0 WITH
Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 WITH Linux-syscall-note 930
and resulted in the second patch in this series.
- if a file had some form of licensing information in it, and was one
of the */uapi/* ones, it was denoted with the Linux-syscall-note if
any GPL family license was found in the file or had no licensing in
it (per prior point). Results summary:
SPDX license identifier # files
---------------------------------------------------|------
GPL-2.0 WITH Linux-syscall-note 270
GPL-2.0+ WITH Linux-syscall-note 169
((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21
((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17
LGPL-2.1+ WITH Linux-syscall-note 15
GPL-1.0+ WITH Linux-syscall-note 14
((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5
LGPL-2.0+ WITH Linux-syscall-note 4
LGPL-2.1 WITH Linux-syscall-note 3
((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3
((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1
and that resulted in the third patch in this series.
- when the two scanners agreed on the detected license(s), that became
the concluded license(s).
- when there was disagreement between the two scanners (one detected a
license but the other didn't, or they both detected different
licenses) a manual inspection of the file occurred.
- In most cases a manual inspection of the information in the file
resulted in a clear resolution of the license that should apply (and
which scanner probably needed to revisit its heuristics).
- When it was not immediately clear, the license identifier was
confirmed with lawyers working with the Linux Foundation.
- If there was any question as to the appropriate license identifier,
the file was flagged for further research and to be revisited later
in time.
In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.
Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights. The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.
Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.
In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.
Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
- a full scancode scan run, collecting the matched texts, detected
license ids and scores
- reviewing anything where there was a license detected (about 500+
files) to ensure that the applied SPDX license was correct
- reviewing anything where there was no detection but the patch license
was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
SPDX license was correct
This produced a worksheet with 20 files needing minor correction. This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.
These .csv files were then reviewed by Greg. Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected. This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.) Finally Greg ran the script using the .csv files to
generate the patches.
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-01 15:07:57 +01:00
|
|
|
// SPDX-License-Identifier: GPL-2.0
|
2010-04-06 15:14:15 -07:00
|
|
|
#include <linux/ceph/ceph_debug.h>
|
2009-10-06 11:31:09 -07:00
|
|
|
|
|
|
|
#include <linux/backing-dev.h>
|
|
|
|
#include <linux/fs.h>
|
|
|
|
#include <linux/mm.h>
|
2021-12-22 17:21:04 +00:00
|
|
|
#include <linux/swap.h>
|
2009-10-06 11:31:09 -07:00
|
|
|
#include <linux/pagemap.h>
|
include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.
percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.
http://userweb.kernel.org/~tj/misc/slabh-sweep.py
The script does the followings.
* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.
* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.
* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.
The conversion was done in the following steps.
1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.
2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.
3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.
4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.
5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.
6. percpu.h was updated not to include slab.h.
7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).
* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig
8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.
Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
2010-03-24 17:04:11 +09:00
|
|
|
#include <linux/slab.h>
|
2009-10-06 11:31:09 -07:00
|
|
|
#include <linux/pagevec.h>
|
|
|
|
#include <linux/task_io_accounting_ops.h>
|
2017-02-03 23:47:37 +01:00
|
|
|
#include <linux/signal.h>
|
2019-06-06 08:57:27 -04:00
|
|
|
#include <linux/iversion.h>
|
2020-03-19 23:45:01 -04:00
|
|
|
#include <linux/ktime.h>
|
2020-06-01 10:10:21 -04:00
|
|
|
#include <linux/netfs.h>
|
2024-07-02 00:40:22 +01:00
|
|
|
#include <trace/events/netfs.h>
|
2009-10-06 11:31:09 -07:00
|
|
|
|
|
|
|
#include "super.h"
|
2010-04-06 15:14:15 -07:00
|
|
|
#include "mds_client.h"
|
2013-08-21 17:29:54 -04:00
|
|
|
#include "cache.h"
|
2020-03-19 23:45:01 -04:00
|
|
|
#include "metric.h"
|
2022-08-25 09:31:22 -04:00
|
|
|
#include "crypto.h"
|
2010-04-06 15:14:15 -07:00
|
|
|
#include <linux/ceph/osd_client.h>
|
2018-02-17 10:41:20 +01:00
|
|
|
#include <linux/ceph/striper.h>
|
2009-10-06 11:31:09 -07:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Ceph address space ops.
|
|
|
|
*
|
|
|
|
* There are a few funny things going on here.
|
|
|
|
*
|
|
|
|
* The page->private field is used to reference a struct
|
|
|
|
* ceph_snap_context for _every_ dirty page. This indicates which
|
|
|
|
* snapshot the page was logically dirtied in, and thus which snap
|
|
|
|
* context needs to be associated with the osd write during writeback.
|
|
|
|
*
|
|
|
|
* Similarly, struct ceph_inode_info maintains a set of counters to
|
2011-03-30 22:57:33 -03:00
|
|
|
* count dirty pages on the inode. In the absence of snapshots,
|
2009-10-06 11:31:09 -07:00
|
|
|
* i_wrbuffer_ref == i_wrbuffer_ref_head == the dirty page count.
|
|
|
|
*
|
|
|
|
* When a snapshot is taken (that is, when the client receives
|
|
|
|
* notification that a snapshot was taken), each inode with caps and
|
|
|
|
* with dirty pages (dirty pages implies there is a cap) gets a new
|
|
|
|
* ceph_cap_snap in the i_cap_snaps list (which is sorted in ascending
|
|
|
|
* order, new snaps go to the tail). The i_wrbuffer_ref_head count is
|
|
|
|
* moved to capsnap->dirty. (Unless a sync write is currently in
|
|
|
|
* progress. In that case, the capsnap is said to be "pending", new
|
|
|
|
* writes cannot start, and the capsnap isn't "finalized" until the
|
|
|
|
* write completes (or fails) and a final size/mtime for the inode for
|
|
|
|
* that snap can be settled upon.) i_wrbuffer_ref_head is reset to 0.
|
|
|
|
*
|
|
|
|
* On writeback, we must submit writes to the osd IN SNAP ORDER. So,
|
|
|
|
* we look for the first capsnap in i_cap_snaps and write out pages in
|
|
|
|
* that snap context _only_. Then we move on to the next capsnap,
|
|
|
|
* eventually reaching the "live" or "head" context (i.e., pages that
|
|
|
|
* are not yet snapped) and are writing the most recently dirtied
|
|
|
|
* pages.
|
|
|
|
*
|
|
|
|
* Invalidate and so forth must take care to ensure the dirty page
|
|
|
|
* accounting is preserved.
|
|
|
|
*/
|
|
|
|
|
2009-12-18 13:51:57 -08:00
|
|
|
#define CONGESTION_ON_THRESH(congestion_kb) (congestion_kb >> (PAGE_SHIFT-10))
|
|
|
|
#define CONGESTION_OFF_THRESH(congestion_kb) \
|
|
|
|
(CONGESTION_ON_THRESH(congestion_kb) - \
|
|
|
|
(CONGESTION_ON_THRESH(congestion_kb) >> 2))
|
|
|
|
|
2020-06-05 10:43:21 -04:00
|
|
|
static int ceph_netfs_check_write_begin(struct file *file, loff_t pos, unsigned int len,
|
2022-07-11 12:11:21 +08:00
|
|
|
struct folio **foliop, void **_fsdata);
|
2020-06-05 10:43:21 -04:00
|
|
|
|
2012-05-28 14:44:30 +08:00
|
|
|
static inline struct ceph_snap_context *page_snap_context(struct page *page)
|
|
|
|
{
|
|
|
|
if (PagePrivate(page))
|
|
|
|
return (void *)page->private;
|
|
|
|
return NULL;
|
|
|
|
}
|
2009-10-06 11:31:09 -07:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Dirty a page. Optimistically adjust accounting, on the assumption
|
|
|
|
* that we won't race with invalidate. If we do, readjust.
|
|
|
|
*/
|
2022-02-09 20:22:01 +00:00
|
|
|
static bool ceph_dirty_folio(struct address_space *mapping, struct folio *folio)
|
2009-10-06 11:31:09 -07:00
|
|
|
{
|
2023-06-12 09:04:07 +08:00
|
|
|
struct inode *inode = mapping->host;
|
|
|
|
struct ceph_client *cl = ceph_inode_to_client(inode);
|
2025-02-04 16:02:49 -08:00
|
|
|
struct ceph_mds_client *mdsc = ceph_sb_to_mdsc(inode->i_sb);
|
2009-10-06 11:31:09 -07:00
|
|
|
struct ceph_inode_info *ci;
|
|
|
|
struct ceph_snap_context *snapc;
|
|
|
|
|
2022-02-09 20:22:01 +00:00
|
|
|
if (folio_test_dirty(folio)) {
|
2023-06-12 09:04:07 +08:00
|
|
|
doutc(cl, "%llx.%llx %p idx %lu -- already dirty\n",
|
|
|
|
ceph_vinop(inode), folio, folio->index);
|
2022-05-05 18:53:09 +08:00
|
|
|
VM_BUG_ON_FOLIO(!folio_test_private(folio), folio);
|
2022-02-09 20:22:01 +00:00
|
|
|
return false;
|
2009-10-06 11:31:09 -07:00
|
|
|
}
|
|
|
|
|
2025-02-04 16:02:49 -08:00
|
|
|
atomic64_inc(&mdsc->dirty_folios);
|
|
|
|
|
2009-10-06 11:31:09 -07:00
|
|
|
ci = ceph_inode(inode);
|
|
|
|
|
|
|
|
/* dirty the head */
|
2011-11-30 09:47:09 -08:00
|
|
|
spin_lock(&ci->i_ceph_lock);
|
2015-04-30 14:40:54 +08:00
|
|
|
if (__ceph_have_pending_cap_snap(ci)) {
|
|
|
|
struct ceph_cap_snap *capsnap =
|
|
|
|
list_last_entry(&ci->i_cap_snaps,
|
|
|
|
struct ceph_cap_snap,
|
|
|
|
ci_item);
|
|
|
|
snapc = ceph_get_snap_context(capsnap->context);
|
|
|
|
capsnap->dirty_pages++;
|
|
|
|
} else {
|
|
|
|
BUG_ON(!ci->i_head_snapc);
|
|
|
|
snapc = ceph_get_snap_context(ci->i_head_snapc);
|
|
|
|
++ci->i_wrbuffer_ref_head;
|
|
|
|
}
|
2009-10-06 11:31:09 -07:00
|
|
|
if (ci->i_wrbuffer_ref == 0)
|
2011-03-29 18:08:50 +11:00
|
|
|
ihold(inode);
|
2009-10-06 11:31:09 -07:00
|
|
|
++ci->i_wrbuffer_ref;
|
2023-06-12 09:04:07 +08:00
|
|
|
doutc(cl, "%llx.%llx %p idx %lu head %d/%d -> %d/%d "
|
|
|
|
"snapc %p seq %lld (%d snaps)\n",
|
|
|
|
ceph_vinop(inode), folio, folio->index,
|
|
|
|
ci->i_wrbuffer_ref-1, ci->i_wrbuffer_ref_head-1,
|
|
|
|
ci->i_wrbuffer_ref, ci->i_wrbuffer_ref_head,
|
|
|
|
snapc, snapc->seq, snapc->num_snaps);
|
2011-11-30 09:47:09 -08:00
|
|
|
spin_unlock(&ci->i_ceph_lock);
|
2009-10-06 11:31:09 -07:00
|
|
|
|
2013-08-21 16:27:34 +08:00
|
|
|
/*
|
2022-02-09 20:22:01 +00:00
|
|
|
* Reference snap context in folio->private. Also set
|
2022-02-09 20:21:40 +00:00
|
|
|
* PagePrivate so that we get invalidate_folio callback.
|
2013-08-21 16:27:34 +08:00
|
|
|
*/
|
2022-06-10 11:40:13 -04:00
|
|
|
VM_WARN_ON_FOLIO(folio->private, folio);
|
2022-02-09 20:22:01 +00:00
|
|
|
folio_attach_private(folio, snapc);
|
2009-10-06 11:31:09 -07:00
|
|
|
|
2022-02-09 20:22:01 +00:00
|
|
|
return ceph_fscache_dirty_folio(mapping, folio);
|
2009-10-06 11:31:09 -07:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2022-02-09 20:21:40 +00:00
|
|
|
* If we are truncating the full folio (i.e. offset == 0), adjust the
|
|
|
|
* dirty folio counters appropriately. Only called if there is private
|
|
|
|
* data on the folio.
|
2009-10-06 11:31:09 -07:00
|
|
|
*/
|
2022-02-09 20:21:40 +00:00
|
|
|
static void ceph_invalidate_folio(struct folio *folio, size_t offset,
|
|
|
|
size_t length)
|
2009-10-06 11:31:09 -07:00
|
|
|
{
|
2023-06-12 09:04:07 +08:00
|
|
|
struct inode *inode = folio->mapping->host;
|
|
|
|
struct ceph_client *cl = ceph_inode_to_client(inode);
|
|
|
|
struct ceph_inode_info *ci = ceph_inode(inode);
|
2021-03-23 15:16:52 -04:00
|
|
|
struct ceph_snap_context *snapc;
|
2009-10-06 11:31:09 -07:00
|
|
|
|
2013-08-09 12:59:55 -04:00
|
|
|
|
2022-02-09 20:21:40 +00:00
|
|
|
if (offset != 0 || length != folio_size(folio)) {
|
2023-06-12 09:04:07 +08:00
|
|
|
doutc(cl, "%llx.%llx idx %lu partial dirty page %zu~%zu\n",
|
|
|
|
ceph_vinop(inode), folio->index, offset, length);
|
2013-08-09 12:59:55 -04:00
|
|
|
return;
|
|
|
|
}
|
2010-02-22 17:17:44 +03:00
|
|
|
|
2022-02-09 20:21:40 +00:00
|
|
|
WARN_ON(!folio_test_locked(folio));
|
2022-05-05 18:53:09 +08:00
|
|
|
if (folio_test_private(folio)) {
|
2023-06-12 09:04:07 +08:00
|
|
|
doutc(cl, "%llx.%llx idx %lu full dirty page\n",
|
|
|
|
ceph_vinop(inode), folio->index);
|
2013-08-21 17:29:54 -04:00
|
|
|
|
2022-02-09 20:21:40 +00:00
|
|
|
snapc = folio_detach_private(folio);
|
2021-12-07 08:44:50 -05:00
|
|
|
ceph_put_wrbuffer_cap_refs(ci, 1, snapc);
|
|
|
|
ceph_put_snap_context(snapc);
|
|
|
|
}
|
2013-08-09 12:59:55 -04:00
|
|
|
|
2021-08-20 17:08:30 +01:00
|
|
|
netfs_invalidate_folio(folio, offset, length);
|
2009-10-06 11:31:09 -07:00
|
|
|
}
|
|
|
|
|
2022-02-17 10:01:23 +00:00
|
|
|
static void ceph_netfs_expand_readahead(struct netfs_io_request *rreq)
|
2020-06-01 10:10:21 -04:00
|
|
|
{
|
2022-01-17 14:32:12 -05:00
|
|
|
struct inode *inode = rreq->inode;
|
2020-06-01 10:10:21 -04:00
|
|
|
struct ceph_inode_info *ci = ceph_inode(inode);
|
|
|
|
struct ceph_file_layout *lo = &ci->i_layout;
|
2023-05-04 19:00:42 +08:00
|
|
|
unsigned long max_pages = inode->i_sb->s_bdi->ra_pages;
|
|
|
|
loff_t end = rreq->start + rreq->len, new_end;
|
|
|
|
struct ceph_netfs_request_data *priv = rreq->netfs_priv;
|
|
|
|
unsigned long max_len;
|
2020-06-01 10:10:21 -04:00
|
|
|
u32 blockoff;
|
|
|
|
|
2023-05-04 19:00:42 +08:00
|
|
|
if (priv) {
|
|
|
|
/* Readahead is disabled by posix_fadvise POSIX_FADV_RANDOM */
|
|
|
|
if (priv->file_ra_disabled)
|
|
|
|
max_pages = 0;
|
|
|
|
else
|
|
|
|
max_pages = priv->file_ra_pages;
|
2020-06-01 10:10:21 -04:00
|
|
|
|
2023-05-04 19:00:42 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
/* Readahead is disabled */
|
|
|
|
if (!max_pages)
|
|
|
|
return;
|
|
|
|
|
|
|
|
max_len = max_pages << PAGE_SHIFT;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Try to expand the length forward by rounding up it to the next
|
|
|
|
* block, but do not exceed the file size, unless the original
|
|
|
|
* request already exceeds it.
|
|
|
|
*/
|
2024-03-18 16:57:31 +00:00
|
|
|
new_end = umin(round_up(end, lo->stripe_unit), rreq->i_size);
|
2023-05-04 19:00:42 +08:00
|
|
|
if (new_end > end && new_end <= rreq->start + max_len)
|
|
|
|
rreq->len = new_end - rreq->start;
|
|
|
|
|
|
|
|
/* Try to expand the start downward */
|
|
|
|
div_u64_rem(rreq->start, lo->stripe_unit, &blockoff);
|
|
|
|
if (rreq->len + blockoff <= max_len) {
|
|
|
|
rreq->start -= blockoff;
|
|
|
|
rreq->len += blockoff;
|
|
|
|
}
|
2020-06-01 10:10:21 -04:00
|
|
|
}
|
|
|
|
|
|
|
|
static void finish_netfs_read(struct ceph_osd_request *req)
|
|
|
|
{
|
2022-08-25 09:31:22 -04:00
|
|
|
struct inode *inode = req->r_inode;
|
2023-06-12 10:50:38 +08:00
|
|
|
struct ceph_fs_client *fsc = ceph_inode_to_fs_client(inode);
|
2023-06-12 09:04:07 +08:00
|
|
|
struct ceph_client *cl = fsc->client;
|
2020-06-01 10:10:21 -04:00
|
|
|
struct ceph_osd_data *osd_data = osd_req_op_extent_osd_data(req, 0);
|
2022-02-17 10:01:23 +00:00
|
|
|
struct netfs_io_subrequest *subreq = req->r_priv;
|
2022-02-26 06:33:03 -05:00
|
|
|
struct ceph_osd_req_op *op = &req->r_ops[0];
|
2020-06-01 10:10:21 -04:00
|
|
|
int err = req->r_result;
|
2022-02-26 06:33:03 -05:00
|
|
|
bool sparse = (op->op == CEPH_OSD_OP_SPARSE_READ);
|
2020-06-01 10:10:21 -04:00
|
|
|
|
2021-03-22 20:28:49 +08:00
|
|
|
ceph_update_read_metrics(&fsc->mdsc->metric, req->r_start_latency,
|
2021-05-13 09:40:53 +08:00
|
|
|
req->r_end_latency, osd_data->length, err);
|
2020-06-01 10:10:21 -04:00
|
|
|
|
2023-06-12 09:04:07 +08:00
|
|
|
doutc(cl, "result %d subreq->len=%zu i_size=%lld\n", req->r_result,
|
|
|
|
subreq->len, i_size_read(req->r_inode));
|
2020-06-01 10:10:21 -04:00
|
|
|
|
|
|
|
/* no object means success but no data */
|
2024-12-16 20:41:17 +00:00
|
|
|
if (err == -ENOENT) {
|
|
|
|
__set_bit(NETFS_SREQ_CLEAR_TAIL, &subreq->flags);
|
|
|
|
__set_bit(NETFS_SREQ_MADE_PROGRESS, &subreq->flags);
|
2020-06-01 10:10:21 -04:00
|
|
|
err = 0;
|
2024-12-16 20:41:17 +00:00
|
|
|
} else if (err == -EBLOCKLISTED) {
|
2020-06-01 10:10:21 -04:00
|
|
|
fsc->blocklisted = true;
|
2024-12-16 20:41:17 +00:00
|
|
|
}
|
2020-06-01 10:10:21 -04:00
|
|
|
|
2022-08-25 09:31:22 -04:00
|
|
|
if (err >= 0) {
|
|
|
|
if (sparse && err > 0)
|
|
|
|
err = ceph_sparse_ext_map_end(op);
|
2024-08-08 14:29:38 +01:00
|
|
|
if (err < subreq->len &&
|
2025-05-23 08:57:52 +01:00
|
|
|
subreq->rreq->origin != NETFS_UNBUFFERED_READ &&
|
2024-08-08 14:29:38 +01:00
|
|
|
subreq->rreq->origin != NETFS_DIO_READ)
|
2022-08-25 09:31:22 -04:00
|
|
|
__set_bit(NETFS_SREQ_CLEAR_TAIL, &subreq->flags);
|
|
|
|
if (IS_ENCRYPTED(inode) && err > 0) {
|
|
|
|
err = ceph_fscrypt_decrypt_extents(inode,
|
|
|
|
osd_data->pages, subreq->start,
|
|
|
|
op->extent.sparse_ext,
|
|
|
|
op->extent.sparse_ext_cnt);
|
|
|
|
if (err > subreq->len)
|
|
|
|
err = subreq->len;
|
|
|
|
}
|
2024-12-16 20:41:17 +00:00
|
|
|
if (err > 0)
|
|
|
|
__set_bit(NETFS_SREQ_CLEAR_TAIL, &subreq->flags);
|
2022-08-25 09:31:22 -04:00
|
|
|
}
|
2020-06-01 10:10:21 -04:00
|
|
|
|
2022-08-25 09:31:22 -04:00
|
|
|
if (osd_data->type == CEPH_OSD_DATA_TYPE_PAGES) {
|
|
|
|
ceph_put_page_vector(osd_data->pages,
|
|
|
|
calc_pages_for(osd_data->alignment,
|
|
|
|
osd_data->length), false);
|
|
|
|
}
|
2024-07-02 00:40:22 +01:00
|
|
|
if (err > 0) {
|
|
|
|
subreq->transferred = err;
|
|
|
|
err = 0;
|
|
|
|
}
|
2024-12-16 20:40:58 +00:00
|
|
|
subreq->error = err;
|
2024-07-02 00:40:22 +01:00
|
|
|
trace_netfs_sreq(subreq, netfs_sreq_trace_io_progress);
|
2024-12-16 20:40:59 +00:00
|
|
|
netfs_read_subreq_terminated(subreq);
|
2020-06-01 10:10:21 -04:00
|
|
|
iput(req->r_inode);
|
2023-05-09 16:38:49 +08:00
|
|
|
ceph_dec_osd_stopping_blocker(fsc->mdsc);
|
2020-06-01 10:10:21 -04:00
|
|
|
}
|
|
|
|
|
2022-02-17 10:01:23 +00:00
|
|
|
static bool ceph_netfs_issue_op_inline(struct netfs_io_subrequest *subreq)
|
2021-12-17 16:14:15 +00:00
|
|
|
{
|
2022-02-17 10:01:23 +00:00
|
|
|
struct netfs_io_request *rreq = subreq->rreq;
|
2021-12-17 16:14:15 +00:00
|
|
|
struct inode *inode = rreq->inode;
|
|
|
|
struct ceph_mds_reply_info_parsed *rinfo;
|
|
|
|
struct ceph_mds_reply_info_in *iinfo;
|
|
|
|
struct ceph_mds_request *req;
|
|
|
|
struct ceph_mds_client *mdsc = ceph_sb_to_mdsc(inode->i_sb);
|
|
|
|
struct ceph_inode_info *ci = ceph_inode(inode);
|
|
|
|
ssize_t err = 0;
|
|
|
|
size_t len;
|
2022-04-21 11:26:40 +08:00
|
|
|
int mode;
|
2021-12-17 16:14:15 +00:00
|
|
|
|
2025-05-23 08:57:52 +01:00
|
|
|
if (rreq->origin != NETFS_UNBUFFERED_READ &&
|
|
|
|
rreq->origin != NETFS_DIO_READ)
|
2024-08-08 14:29:38 +01:00
|
|
|
__set_bit(NETFS_SREQ_CLEAR_TAIL, &subreq->flags);
|
2022-02-17 10:14:32 +00:00
|
|
|
__clear_bit(NETFS_SREQ_COPY_TO_CACHE, &subreq->flags);
|
2021-12-17 16:14:15 +00:00
|
|
|
|
|
|
|
if (subreq->start >= inode->i_size)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
/* We need to fetch the inline data. */
|
2022-04-21 11:26:40 +08:00
|
|
|
mode = ceph_try_to_choose_auth_mds(inode, CEPH_STAT_CAP_INLINE_DATA);
|
|
|
|
req = ceph_mdsc_create_request(mdsc, CEPH_MDS_OP_GETATTR, mode);
|
2021-12-17 16:14:15 +00:00
|
|
|
if (IS_ERR(req)) {
|
|
|
|
err = PTR_ERR(req);
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
req->r_ino1 = ci->i_vino;
|
|
|
|
req->r_args.getattr.mask = cpu_to_le32(CEPH_STAT_CAP_INLINE_DATA);
|
|
|
|
req->r_num_caps = 2;
|
|
|
|
|
2024-07-02 00:40:22 +01:00
|
|
|
trace_netfs_sreq(subreq, netfs_sreq_trace_submit);
|
2021-12-17 16:14:15 +00:00
|
|
|
err = ceph_mdsc_do_request(mdsc, NULL, req);
|
|
|
|
if (err < 0)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
rinfo = &req->r_reply_info;
|
|
|
|
iinfo = &rinfo->targeti;
|
|
|
|
if (iinfo->inline_version == CEPH_INLINE_NONE) {
|
|
|
|
/* The data got uninlined */
|
|
|
|
ceph_mdsc_put_request(req);
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
|
|
|
len = min_t(size_t, iinfo->inline_len - subreq->start, subreq->len);
|
2024-07-02 00:40:22 +01:00
|
|
|
err = copy_to_iter(iinfo->inline_data + subreq->start, len, &subreq->io_iter);
|
|
|
|
if (err == 0) {
|
2021-12-17 16:14:15 +00:00
|
|
|
err = -EFAULT;
|
2024-07-02 00:40:22 +01:00
|
|
|
} else {
|
|
|
|
subreq->transferred += err;
|
|
|
|
err = 0;
|
|
|
|
}
|
2021-12-17 16:14:15 +00:00
|
|
|
|
|
|
|
ceph_mdsc_put_request(req);
|
|
|
|
out:
|
2024-12-16 20:40:58 +00:00
|
|
|
subreq->error = err;
|
|
|
|
trace_netfs_sreq(subreq, netfs_sreq_trace_io_progress);
|
2024-12-16 20:40:59 +00:00
|
|
|
netfs_read_subreq_terminated(subreq);
|
2021-12-17 16:14:15 +00:00
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
2024-07-02 00:40:22 +01:00
|
|
|
static int ceph_netfs_prepare_read(struct netfs_io_subrequest *subreq)
|
|
|
|
{
|
|
|
|
struct netfs_io_request *rreq = subreq->rreq;
|
|
|
|
struct inode *inode = rreq->inode;
|
|
|
|
struct ceph_inode_info *ci = ceph_inode(inode);
|
|
|
|
struct ceph_fs_client *fsc = ceph_inode_to_fs_client(inode);
|
|
|
|
u64 objno, objoff;
|
|
|
|
u32 xlen;
|
|
|
|
|
|
|
|
/* Truncate the extent at the end of the current block */
|
|
|
|
ceph_calc_file_object_mapping(&ci->i_layout, subreq->start, subreq->len,
|
|
|
|
&objno, &objoff, &xlen);
|
|
|
|
rreq->io_streams[0].sreq_max_len = umin(xlen, fsc->mount_options->rsize);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2022-02-17 10:14:32 +00:00
|
|
|
static void ceph_netfs_issue_read(struct netfs_io_subrequest *subreq)
|
2020-06-01 10:10:21 -04:00
|
|
|
{
|
2022-02-17 10:01:23 +00:00
|
|
|
struct netfs_io_request *rreq = subreq->rreq;
|
2022-01-17 14:32:12 -05:00
|
|
|
struct inode *inode = rreq->inode;
|
2020-06-01 10:10:21 -04:00
|
|
|
struct ceph_inode_info *ci = ceph_inode(inode);
|
2023-06-12 10:50:38 +08:00
|
|
|
struct ceph_fs_client *fsc = ceph_inode_to_fs_client(inode);
|
2023-06-12 09:04:07 +08:00
|
|
|
struct ceph_client *cl = fsc->client;
|
2023-02-01 09:36:45 +08:00
|
|
|
struct ceph_osd_request *req = NULL;
|
2020-06-01 10:10:21 -04:00
|
|
|
struct ceph_vino vino = ceph_vino(inode);
|
2024-07-02 00:40:22 +01:00
|
|
|
int err;
|
|
|
|
u64 len;
|
2022-08-25 09:31:22 -04:00
|
|
|
bool sparse = IS_ENCRYPTED(inode) || ceph_test_mount_opt(fsc, SPARSEREAD);
|
|
|
|
u64 off = subreq->start;
|
2023-11-07 10:44:41 +08:00
|
|
|
int extent_cnt;
|
2020-06-01 10:10:21 -04:00
|
|
|
|
2023-02-01 09:36:45 +08:00
|
|
|
if (ceph_inode_is_shutdown(inode)) {
|
|
|
|
err = -EIO;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
2022-06-07 10:13:53 +08:00
|
|
|
if (ceph_has_inline_data(ci) && ceph_netfs_issue_op_inline(subreq))
|
2021-12-17 16:14:15 +00:00
|
|
|
return;
|
|
|
|
|
2024-07-02 00:40:22 +01:00
|
|
|
// TODO: This rounding here is slightly dodgy. It *should* work, for
|
|
|
|
// now, as the cache only deals in blocks that are a multiple of
|
|
|
|
// PAGE_SIZE and fscrypt blocks are at most PAGE_SIZE. What needs to
|
|
|
|
// happen is for the fscrypt driving to be moved into netfslib and the
|
|
|
|
// data in the cache also to be stored encrypted.
|
|
|
|
len = subreq->len;
|
2022-08-25 09:31:22 -04:00
|
|
|
ceph_fscrypt_adjust_off_and_len(inode, &off, &len);
|
|
|
|
|
|
|
|
req = ceph_osdc_new_request(&fsc->client->osdc, &ci->i_layout, vino,
|
|
|
|
off, &len, 0, 1, sparse ? CEPH_OSD_OP_SPARSE_READ : CEPH_OSD_OP_READ,
|
2023-11-22 15:55:14 +08:00
|
|
|
CEPH_OSD_FLAG_READ, NULL, ci->i_truncate_seq,
|
|
|
|
ci->i_truncate_size, false);
|
2020-06-01 10:10:21 -04:00
|
|
|
if (IS_ERR(req)) {
|
|
|
|
err = PTR_ERR(req);
|
|
|
|
req = NULL;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
2022-02-26 06:33:03 -05:00
|
|
|
if (sparse) {
|
2023-11-07 10:44:41 +08:00
|
|
|
extent_cnt = __ceph_sparse_read_ext_count(inode, len);
|
|
|
|
err = ceph_alloc_sparse_ext_map(&req->r_ops[0], extent_cnt);
|
2022-02-26 06:33:03 -05:00
|
|
|
if (err)
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
2023-06-12 09:04:07 +08:00
|
|
|
doutc(cl, "%llx.%llx pos=%llu orig_len=%zu len=%llu\n",
|
|
|
|
ceph_vinop(inode), subreq->start, subreq->len, len);
|
2022-08-25 09:31:22 -04:00
|
|
|
|
|
|
|
/*
|
|
|
|
* FIXME: For now, use CEPH_OSD_DATA_TYPE_PAGES instead of _ITER for
|
|
|
|
* encrypted inodes. We'd need infrastructure that handles an iov_iter
|
|
|
|
* instead of page arrays, and we don't have that as of yet. Once the
|
|
|
|
* dust settles on the write helpers and encrypt/decrypt routines for
|
|
|
|
* netfs, we should be able to rework this.
|
|
|
|
*/
|
|
|
|
if (IS_ENCRYPTED(inode)) {
|
|
|
|
struct page **pages;
|
|
|
|
size_t page_off;
|
|
|
|
|
ceph: avoid kernel BUG for encrypted inode with unaligned file size
The generic/397 test hits a BUG_ON for the case of encrypted inode with
unaligned file size (for example, 33K or 1K):
[ 877.737811] run fstests generic/397 at 2025-01-03 12:34:40
[ 877.875761] libceph: mon0 (2)127.0.0.1:40674 session established
[ 877.876130] libceph: client4614 fsid 19b90bca-f1ae-47a6-93dd-0b03ee637949
[ 877.991965] libceph: mon0 (2)127.0.0.1:40674 session established
[ 877.992334] libceph: client4617 fsid 19b90bca-f1ae-47a6-93dd-0b03ee637949
[ 878.017234] libceph: mon0 (2)127.0.0.1:40674 session established
[ 878.017594] libceph: client4620 fsid 19b90bca-f1ae-47a6-93dd-0b03ee637949
[ 878.031394] xfs_io (pid 18988) is setting deprecated v1 encryption policy; recommend upgrading to v2.
[ 878.054528] libceph: mon0 (2)127.0.0.1:40674 session established
[ 878.054892] libceph: client4623 fsid 19b90bca-f1ae-47a6-93dd-0b03ee637949
[ 878.070287] libceph: mon0 (2)127.0.0.1:40674 session established
[ 878.070704] libceph: client4626 fsid 19b90bca-f1ae-47a6-93dd-0b03ee637949
[ 878.264586] libceph: mon0 (2)127.0.0.1:40674 session established
[ 878.265258] libceph: client4629 fsid 19b90bca-f1ae-47a6-93dd-0b03ee637949
[ 878.374578] -----------[ cut here ]------------
[ 878.374586] kernel BUG at net/ceph/messenger.c:1070!
[ 878.375150] Oops: invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
[ 878.378145] CPU: 2 UID: 0 PID: 4759 Comm: kworker/2:9 Not tainted 6.13.0-rc5+ #1
[ 878.378969] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
[ 878.380167] Workqueue: ceph-msgr ceph_con_workfn
[ 878.381639] RIP: 0010:ceph_msg_data_cursor_init+0x42/0x50
[ 878.382152] Code: 89 17 48 8b 46 70 55 48 89 47 08 c7 47 18 00 00 00 00 48 89 e5 e8 de cc ff ff 5d 31 c0 31 d2 31 f6 31 ff c3 cc cc cc cc 0f 0b <0f> 0b 0f 0b 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 90
[ 878.383928] RSP: 0018:ffffb4ffc7cbbd28 EFLAGS: 00010287
[ 878.384447] RAX: ffffffff82bb9ac0 RBX: ffff981390c2f1f8 RCX: 0000000000000000
[ 878.385129] RDX: 0000000000009000 RSI: ffff981288232b58 RDI: ffff981390c2f378
[ 878.385839] RBP: ffffb4ffc7cbbe18 R08: 0000000000000000 R09: 0000000000000000
[ 878.386539] R10: 0000000000000000 R11: 0000000000000000 R12: ffff981390c2f030
[ 878.387203] R13: ffff981288232b58 R14: 0000000000000029 R15: 0000000000000001
[ 878.387877] FS: 0000000000000000(0000) GS:ffff9814b7900000(0000) knlGS:0000000000000000
[ 878.388663] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 878.389212] CR2: 00005e106a0554e0 CR3: 0000000112bf0001 CR4: 0000000000772ef0
[ 878.389921] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 878.390620] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 878.391307] PKRU: 55555554
[ 878.391567] Call Trace:
[ 878.391807] <TASK>
[ 878.392021] ? show_regs+0x71/0x90
[ 878.392391] ? die+0x38/0xa0
[ 878.392667] ? do_trap+0xdb/0x100
[ 878.392981] ? do_error_trap+0x75/0xb0
[ 878.393372] ? ceph_msg_data_cursor_init+0x42/0x50
[ 878.393842] ? exc_invalid_op+0x53/0x80
[ 878.394232] ? ceph_msg_data_cursor_init+0x42/0x50
[ 878.394694] ? asm_exc_invalid_op+0x1b/0x20
[ 878.395099] ? ceph_msg_data_cursor_init+0x42/0x50
[ 878.395583] ? ceph_con_v2_try_read+0xd16/0x2220
[ 878.396027] ? _raw_spin_unlock+0xe/0x40
[ 878.396428] ? raw_spin_rq_unlock+0x10/0x40
[ 878.396842] ? finish_task_switch.isra.0+0x97/0x310
[ 878.397338] ? __schedule+0x44b/0x16b0
[ 878.397738] ceph_con_workfn+0x326/0x750
[ 878.398121] process_one_work+0x188/0x3d0
[ 878.398522] ? __pfx_worker_thread+0x10/0x10
[ 878.398929] worker_thread+0x2b5/0x3c0
[ 878.399310] ? __pfx_worker_thread+0x10/0x10
[ 878.399727] kthread+0xe1/0x120
[ 878.400031] ? __pfx_kthread+0x10/0x10
[ 878.400431] ret_from_fork+0x43/0x70
[ 878.400771] ? __pfx_kthread+0x10/0x10
[ 878.401127] ret_from_fork_asm+0x1a/0x30
[ 878.401543] </TASK>
[ 878.401760] Modules linked in: hctr2 nhpoly1305_avx2 nhpoly1305_sse2 nhpoly1305 chacha_generic chacha_x86_64 libchacha adiantum libpoly1305 essiv authenc mptcp_diag xsk_diag tcp_diag udp_diag raw_diag inet_diag unix_diag af_packet_diag netlink_diag intel_rapl_msr intel_rapl_common intel_uncore_frequency_common skx_edac_common nfit kvm_intel kvm crct10dif_pclmul crc32_pclmul polyval_clmulni polyval_generic ghash_clmulni_intel sha256_ssse3 sha1_ssse3 aesni_intel joydev crypto_simd cryptd rapl input_leds psmouse sch_fq_codel serio_raw bochs i2c_piix4 floppy qemu_fw_cfg i2c_smbus mac_hid pata_acpi msr parport_pc ppdev lp parport efi_pstore ip_tables x_tables
[ 878.407319] ---[ end trace 0000000000000000 ]---
[ 878.407775] RIP: 0010:ceph_msg_data_cursor_init+0x42/0x50
[ 878.408317] Code: 89 17 48 8b 46 70 55 48 89 47 08 c7 47 18 00 00 00 00 48 89 e5 e8 de cc ff ff 5d 31 c0 31 d2 31 f6 31 ff c3 cc cc cc cc 0f 0b <0f> 0b 0f 0b 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 90
[ 878.410087] RSP: 0018:ffffb4ffc7cbbd28 EFLAGS: 00010287
[ 878.410609] RAX: ffffffff82bb9ac0 RBX: ffff981390c2f1f8 RCX: 0000000000000000
[ 878.411318] RDX: 0000000000009000 RSI: ffff981288232b58 RDI: ffff981390c2f378
[ 878.412014] RBP: ffffb4ffc7cbbe18 R08: 0000000000000000 R09: 0000000000000000
[ 878.412735] R10: 0000000000000000 R11: 0000000000000000 R12: ffff981390c2f030
[ 878.413438] R13: ffff981288232b58 R14: 0000000000000029 R15: 0000000000000001
[ 878.414121] FS: 0000000000000000(0000) GS:ffff9814b7900000(0000) knlGS:0000000000000000
[ 878.414935] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 878.415516] CR2: 00005e106a0554e0 CR3: 0000000112bf0001 CR4: 0000000000772ef0
[ 878.416211] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 878.416907] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 878.417630] PKRU: 55555554
(gdb) l *ceph_msg_data_cursor_init+0x42
0xffffffff823b45a2 is in ceph_msg_data_cursor_init (net/ceph/messenger.c:1070).
1065
1066 void ceph_msg_data_cursor_init(struct ceph_msg_data_cursor *cursor,
1067 struct ceph_msg *msg, size_t length)
1068 {
1069 BUG_ON(!length);
1070 BUG_ON(length > msg->data_length);
1071 BUG_ON(!msg->num_data_items);
1072
1073 cursor->total_resid = length;
1074 cursor->data = msg->data;
The issue takes place because of this:
[ 202.628853] libceph: net/ceph/messenger_v2.c:2034 prepare_sparse_read_data(): msg->data_length 33792, msg->sparse_read_total 36864
1070 BUG_ON(length > msg->data_length);
The generic/397 test (xfstests) executes such steps:
(1) create encrypted files and directories;
(2) access the created files and folders with encryption key;
(3) access the created files and folders without encryption key.
The issue takes place in this portion of code:
if (IS_ENCRYPTED(inode)) {
struct page **pages;
size_t page_off;
err = iov_iter_get_pages_alloc2(&subreq->io_iter, &pages, len,
&page_off);
if (err < 0) {
doutc(cl, "%llx.%llx failed to allocate pages, %d\n",
ceph_vinop(inode), err);
goto out;
}
/* should always give us a page-aligned read */
WARN_ON_ONCE(page_off);
len = err;
err = 0;
osd_req_op_extent_osd_data_pages(req, 0, pages, len, 0, false,
false);
The reason of the issue is that subreq->io_iter.count keeps unaligned
value of length:
[ 347.751182] lib/iov_iter.c:1185 __iov_iter_get_pages_alloc(): maxsize 36864, maxpages 4294967295, start 18446659367320516064
[ 347.752808] lib/iov_iter.c:1196 __iov_iter_get_pages_alloc(): maxsize 33792, maxpages 4294967295, start 18446659367320516064
[ 347.754394] lib/iov_iter.c:1015 iter_folioq_get_pages(): maxsize 33792, maxpages 4294967295, extracted 0, _start_offset 18446659367320516064
This patch simply assigns the aligned value to subreq->io_iter.count
before calling iov_iter_get_pages_alloc2().
[ idryomov: tag the comment with FIXME to make it clear that it's only
a workaround for netfslib not coexisting with fscrypt nicely
(this is also noted in another pre-existing comment) ]
Cc: David Howells <dhowells@redhat.com>
Cc: stable@vger.kernel.org
Fixes: ee4cdf7ba857 ("netfs: Speed up buffered reading")
Signed-off-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2025-01-16 16:30:08 -08:00
|
|
|
/*
|
|
|
|
* FIXME: io_iter.count needs to be corrected to aligned
|
|
|
|
* length. Otherwise, iov_iter_get_pages_alloc2() operates
|
|
|
|
* with the initial unaligned length value. As a result,
|
|
|
|
* ceph_msg_data_cursor_init() triggers BUG_ON() in the case
|
|
|
|
* if msg->sparse_read_total > msg->data_length.
|
|
|
|
*/
|
|
|
|
subreq->io_iter.count = len;
|
|
|
|
|
2024-07-02 00:40:22 +01:00
|
|
|
err = iov_iter_get_pages_alloc2(&subreq->io_iter, &pages, len, &page_off);
|
2022-08-25 09:31:22 -04:00
|
|
|
if (err < 0) {
|
2023-06-12 09:04:07 +08:00
|
|
|
doutc(cl, "%llx.%llx failed to allocate pages, %d\n",
|
|
|
|
ceph_vinop(inode), err);
|
2022-08-25 09:31:22 -04:00
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* should always give us a page-aligned read */
|
|
|
|
WARN_ON_ONCE(page_off);
|
|
|
|
len = err;
|
|
|
|
err = 0;
|
|
|
|
|
|
|
|
osd_req_op_extent_osd_data_pages(req, 0, pages, len, 0, false,
|
|
|
|
false);
|
|
|
|
} else {
|
2024-07-02 00:40:22 +01:00
|
|
|
osd_req_op_extent_osd_iter(req, 0, &subreq->io_iter);
|
2022-08-25 09:31:22 -04:00
|
|
|
}
|
2023-05-09 16:38:49 +08:00
|
|
|
if (!ceph_inc_osd_stopping_blocker(fsc->mdsc)) {
|
|
|
|
err = -EIO;
|
|
|
|
goto out;
|
|
|
|
}
|
2020-06-01 10:10:21 -04:00
|
|
|
req->r_callback = finish_netfs_read;
|
|
|
|
req->r_priv = subreq;
|
|
|
|
req->r_inode = inode;
|
|
|
|
ihold(inode);
|
|
|
|
|
2024-07-02 00:40:22 +01:00
|
|
|
trace_netfs_sreq(subreq, netfs_sreq_trace_submit);
|
2022-06-30 16:21:50 -04:00
|
|
|
ceph_osdc_start_request(req->r_osdc, req);
|
2020-06-01 10:10:21 -04:00
|
|
|
out:
|
|
|
|
ceph_osdc_put_request(req);
|
2024-12-16 20:40:58 +00:00
|
|
|
if (err) {
|
|
|
|
subreq->error = err;
|
2024-12-16 20:40:59 +00:00
|
|
|
netfs_read_subreq_terminated(subreq);
|
2024-12-16 20:40:58 +00:00
|
|
|
}
|
2023-06-12 09:04:07 +08:00
|
|
|
doutc(cl, "%llx.%llx result %d\n", ceph_vinop(inode), err);
|
2020-06-01 10:10:21 -04:00
|
|
|
}
|
|
|
|
|
2022-03-09 21:45:22 +00:00
|
|
|
static int ceph_init_request(struct netfs_io_request *rreq, struct file *file)
|
|
|
|
{
|
|
|
|
struct inode *inode = rreq->inode;
|
2024-07-02 00:40:22 +01:00
|
|
|
struct ceph_fs_client *fsc = ceph_inode_to_fs_client(inode);
|
2023-06-12 09:04:07 +08:00
|
|
|
struct ceph_client *cl = ceph_inode_to_client(inode);
|
2022-03-09 21:45:22 +00:00
|
|
|
int got = 0, want = CEPH_CAP_FILE_CACHE;
|
2023-05-10 19:55:46 +08:00
|
|
|
struct ceph_netfs_request_data *priv;
|
2022-03-09 21:45:22 +00:00
|
|
|
int ret = 0;
|
|
|
|
|
netfs: Fix handling of USE_PGPRIV2 and WRITE_TO_CACHE flags
The NETFS_RREQ_USE_PGPRIV2 and NETFS_RREQ_WRITE_TO_CACHE flags aren't used
correctly. The problem is that we try to set them up in the request
initialisation, but we the cache may be in the process of setting up still,
and so the state may not be correct. Further, we secondarily sample the
cache state and make contradictory decisions later.
The issue arises because we set up the cache resources, which allows the
cache's ->prepare_read() to switch on NETFS_SREQ_COPY_TO_CACHE - which
triggers cache writing even if we didn't set the flags when allocating.
Fix this in the following way:
(1) Drop NETFS_ICTX_USE_PGPRIV2 and instead set NETFS_RREQ_USE_PGPRIV2 in
->init_request() rather than trying to juggle that in
netfs_alloc_request().
(2) Repurpose NETFS_RREQ_USE_PGPRIV2 to merely indicate that if caching is
to be done, then PG_private_2 is to be used rather than only setting
it if we decide to cache and then having netfs_rreq_unlock_folios()
set the non-PG_private_2 writeback-to-cache if it wasn't set.
(3) Split netfs_rreq_unlock_folios() into two functions, one of which
contains the deprecated code for using PG_private_2 to avoid
accidentally doing the writeback path - and always use it if
USE_PGPRIV2 is set.
(4) As NETFS_ICTX_USE_PGPRIV2 is removed, make netfs_write_begin() always
wait for PG_private_2. This function is deprecated and only used by
ceph anyway, and so label it so.
(5) Drop the NETFS_RREQ_WRITE_TO_CACHE flag and use
fscache_operation_valid() on the cache_resources instead. This has
the advantage of picking up the result of netfs_begin_cache_read() and
fscache_begin_write_operation() - which are called after the object is
initialised and will wait for the cache to come to a usable state.
Just reverting ae678317b95e[1] isn't a sufficient fix, so this need to be
applied on top of that. Without this as well, things like:
rcu: INFO: rcu_sched detected expedited stalls on CPUs/tasks: {
and:
WARNING: CPU: 13 PID: 3621 at fs/ceph/caps.c:3386
may happen, along with some UAFs due to PG_private_2 not getting used to
wait on writeback completion.
Fixes: 2ff1e97587f4 ("netfs: Replace PG_fscache by setting folio->private and marking dirty")
Reported-by: Max Kellermann <max.kellermann@ionos.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Ilya Dryomov <idryomov@gmail.com>
cc: Xiubo Li <xiubli@redhat.com>
cc: Hristo Venev <hristo@venev.name>
cc: Jeff Layton <jlayton@kernel.org>
cc: Matthew Wilcox <willy@infradead.org>
cc: ceph-devel@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
Link: https://lore.kernel.org/r/3575457.1722355300@warthog.procyon.org.uk/ [1]
Link: https://lore.kernel.org/r/1173209.1723152682@warthog.procyon.org.uk
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-08-07 19:38:46 +01:00
|
|
|
/* [DEPRECATED] Use PG_private_2 to mark folio being written to the cache. */
|
|
|
|
__set_bit(NETFS_RREQ_USE_PGPRIV2, &rreq->flags);
|
|
|
|
|
2022-03-09 21:45:22 +00:00
|
|
|
if (rreq->origin != NETFS_READAHEAD)
|
|
|
|
return 0;
|
|
|
|
|
2023-05-10 19:55:46 +08:00
|
|
|
priv = kzalloc(sizeof(*priv), GFP_NOFS);
|
|
|
|
if (!priv)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
2022-03-09 21:45:22 +00:00
|
|
|
if (file) {
|
|
|
|
struct ceph_rw_context *rw_ctx;
|
|
|
|
struct ceph_file_info *fi = file->private_data;
|
|
|
|
|
2023-05-10 19:55:46 +08:00
|
|
|
priv->file_ra_pages = file->f_ra.ra_pages;
|
|
|
|
priv->file_ra_disabled = file->f_mode & FMODE_RANDOM;
|
|
|
|
|
2022-03-09 21:45:22 +00:00
|
|
|
rw_ctx = ceph_find_rw_context(fi);
|
2023-05-10 19:55:46 +08:00
|
|
|
if (rw_ctx) {
|
|
|
|
rreq->netfs_priv = priv;
|
2022-03-09 21:45:22 +00:00
|
|
|
return 0;
|
2023-05-10 19:55:46 +08:00
|
|
|
}
|
2022-03-09 21:45:22 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* readahead callers do not necessarily hold Fcb caps
|
|
|
|
* (e.g. fadvise, madvise).
|
|
|
|
*/
|
|
|
|
ret = ceph_try_get_caps(inode, CEPH_CAP_FILE_RD, want, true, &got);
|
|
|
|
if (ret < 0) {
|
2023-06-12 09:04:07 +08:00
|
|
|
doutc(cl, "%llx.%llx, error getting cap\n", ceph_vinop(inode));
|
2023-05-10 19:55:46 +08:00
|
|
|
goto out;
|
2022-03-09 21:45:22 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
if (!(got & want)) {
|
2023-06-12 09:04:07 +08:00
|
|
|
doutc(cl, "%llx.%llx, no cache cap\n", ceph_vinop(inode));
|
2023-05-10 19:55:46 +08:00
|
|
|
ret = -EACCES;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
if (ret == 0) {
|
|
|
|
ret = -EACCES;
|
|
|
|
goto out;
|
2022-03-09 21:45:22 +00:00
|
|
|
}
|
|
|
|
|
2023-05-10 19:55:46 +08:00
|
|
|
priv->caps = got;
|
|
|
|
rreq->netfs_priv = priv;
|
2024-07-02 00:40:22 +01:00
|
|
|
rreq->io_streams[0].sreq_max_len = fsc->mount_options->rsize;
|
2023-05-10 19:55:46 +08:00
|
|
|
|
|
|
|
out:
|
2024-10-02 21:05:12 -04:00
|
|
|
if (ret < 0) {
|
|
|
|
if (got)
|
|
|
|
ceph_put_cap_refs(ceph_inode(inode), got);
|
2023-05-10 19:55:46 +08:00
|
|
|
kfree(priv);
|
2024-10-02 21:05:12 -04:00
|
|
|
}
|
2023-05-10 19:55:46 +08:00
|
|
|
|
|
|
|
return ret;
|
2022-03-09 21:45:22 +00:00
|
|
|
}
|
|
|
|
|
2022-02-25 11:19:14 +00:00
|
|
|
static void ceph_netfs_free_request(struct netfs_io_request *rreq)
|
2020-07-09 14:43:23 -04:00
|
|
|
{
|
2023-05-10 19:55:46 +08:00
|
|
|
struct ceph_netfs_request_data *priv = rreq->netfs_priv;
|
|
|
|
|
|
|
|
if (!priv)
|
|
|
|
return;
|
2020-07-09 14:43:23 -04:00
|
|
|
|
2023-05-10 19:55:46 +08:00
|
|
|
if (priv->caps)
|
|
|
|
ceph_put_cap_refs(ceph_inode(rreq->inode), priv->caps);
|
|
|
|
kfree(priv);
|
|
|
|
rreq->netfs_priv = NULL;
|
2020-07-09 14:43:23 -04:00
|
|
|
}
|
|
|
|
|
2021-06-29 22:37:05 +01:00
|
|
|
const struct netfs_request_ops ceph_netfs_ops = {
|
2022-03-09 21:45:22 +00:00
|
|
|
.init_request = ceph_init_request,
|
2022-02-25 11:19:14 +00:00
|
|
|
.free_request = ceph_netfs_free_request,
|
2024-07-02 00:40:22 +01:00
|
|
|
.prepare_read = ceph_netfs_prepare_read,
|
2022-02-17 10:14:32 +00:00
|
|
|
.issue_read = ceph_netfs_issue_read,
|
2020-06-01 10:10:21 -04:00
|
|
|
.expand_readahead = ceph_netfs_expand_readahead,
|
2020-06-05 10:43:21 -04:00
|
|
|
.check_write_begin = ceph_netfs_check_write_begin,
|
2020-06-01 10:10:21 -04:00
|
|
|
};
|
|
|
|
|
2021-12-07 08:44:51 -05:00
|
|
|
#ifdef CONFIG_CEPH_FSCACHE
|
2024-07-30 17:01:40 +01:00
|
|
|
static void ceph_set_page_fscache(struct page *page)
|
|
|
|
{
|
|
|
|
folio_start_private_2(page_folio(page)); /* [DEPRECATED] */
|
|
|
|
}
|
|
|
|
|
2025-05-19 10:07:03 +01:00
|
|
|
static void ceph_fscache_write_terminated(void *priv, ssize_t error)
|
2021-12-07 08:44:51 -05:00
|
|
|
{
|
|
|
|
struct inode *inode = priv;
|
|
|
|
|
|
|
|
if (IS_ERR_VALUE(error) && error != -ENOBUFS)
|
|
|
|
ceph_fscache_invalidate(inode, false);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void ceph_fscache_write_to_cache(struct inode *inode, u64 off, u64 len, bool caching)
|
|
|
|
{
|
|
|
|
struct ceph_inode_info *ci = ceph_inode(inode);
|
|
|
|
struct fscache_cookie *cookie = ceph_fscache_cookie(ci);
|
|
|
|
|
|
|
|
fscache_write_to_cache(cookie, inode->i_mapping, off, len, i_size_read(inode),
|
netfs: Replace PG_fscache by setting folio->private and marking dirty
When dirty data is being written to the cache, setting/waiting on/clearing
the fscache flag is always done in tandem with setting/waiting on/clearing
the writeback flag. The netfslib buffered write routines wait on and set
both flags and the write request cleanup clears both flags, so the fscache
flag is almost superfluous.
The reason it isn't superfluous is because the fscache flag is also used to
indicate that data just read from the server is being written to the cache.
The flag is used to prevent a race involving overlapping direct-I/O writes
to the cache.
Change this to indicate that a page is in need of being copied to the cache
by placing a magic value in folio->private and marking the folios dirty.
Then when the writeback code sees a folio marked in this way, it only
writes it to the cache and not to the server.
If a folio that has this magic value set is modified, the value is just
replaced and the folio will then be uplodaded too.
With this, PG_fscache is no longer required by the netfslib core, 9p and
afs.
Ceph and nfs, however, still need to use the old PG_fscache-based tracking.
To deal with this, a flag, NETFS_ICTX_USE_PGPRIV2, now has to be set on the
flags in the netfs_inode struct for those filesystems. This reenables the
use of PG_fscache in that inode. 9p and afs use the netfslib write helpers
so get switched over; cifs, for the moment, does page-by-page manual access
to the cache, so doesn't use PG_fscache and is unaffected.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: Matthew Wilcox (Oracle) <willy@infradead.org>
cc: Eric Van Hensbergen <ericvh@kernel.org>
cc: Latchesar Ionkov <lucho@ionkov.net>
cc: Dominique Martinet <asmadeus@codewreck.org>
cc: Christian Schoenebeck <linux_oss@crudebyte.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Ilya Dryomov <idryomov@gmail.com>
cc: Xiubo Li <xiubli@redhat.com>
cc: Steve French <sfrench@samba.org>
cc: Paulo Alcantara <pc@manguebit.com>
cc: Ronnie Sahlberg <ronniesahlberg@gmail.com>
cc: Shyam Prasad N <sprasad@microsoft.com>
cc: Tom Talpey <tom@talpey.com>
cc: Bharath SM <bharathsm@microsoft.com>
cc: Trond Myklebust <trond.myklebust@hammerspace.com>
cc: Anna Schumaker <anna@kernel.org>
cc: netfs@lists.linux.dev
cc: v9fs@lists.linux.dev
cc: linux-afs@lists.infradead.org
cc: ceph-devel@vger.kernel.org
cc: linux-cifs@vger.kernel.org
cc: linux-nfs@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
2024-03-19 10:00:09 +00:00
|
|
|
ceph_fscache_write_terminated, inode, true, caching);
|
2021-12-07 08:44:51 -05:00
|
|
|
}
|
|
|
|
#else
|
2024-07-30 17:01:40 +01:00
|
|
|
static inline void ceph_set_page_fscache(struct page *page)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
|
2021-12-07 08:44:51 -05:00
|
|
|
static inline void ceph_fscache_write_to_cache(struct inode *inode, u64 off, u64 len, bool caching)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
#endif /* CONFIG_CEPH_FSCACHE */
|
|
|
|
|
2017-08-30 11:36:06 +08:00
|
|
|
struct ceph_writeback_ctl
|
|
|
|
{
|
|
|
|
loff_t i_size;
|
|
|
|
u64 truncate_size;
|
|
|
|
u32 truncate_seq;
|
|
|
|
bool size_stable;
|
2025-02-04 16:02:46 -08:00
|
|
|
|
2017-09-01 16:53:58 +08:00
|
|
|
bool head_snapc;
|
2025-02-04 16:02:46 -08:00
|
|
|
struct ceph_snap_context *snapc;
|
|
|
|
struct ceph_snap_context *last_snapc;
|
|
|
|
|
|
|
|
bool done;
|
|
|
|
bool should_loop;
|
|
|
|
bool range_whole;
|
|
|
|
pgoff_t start_index;
|
|
|
|
pgoff_t index;
|
|
|
|
pgoff_t end;
|
|
|
|
xa_mark_t tag;
|
|
|
|
|
|
|
|
pgoff_t strip_unit_end;
|
|
|
|
unsigned int wsize;
|
|
|
|
unsigned int nr_folios;
|
|
|
|
unsigned int max_pages;
|
|
|
|
unsigned int locked_pages;
|
|
|
|
|
|
|
|
int op_idx;
|
|
|
|
int num_ops;
|
|
|
|
u64 offset;
|
|
|
|
u64 len;
|
|
|
|
|
|
|
|
struct folio_batch fbatch;
|
|
|
|
unsigned int processed_in_fbatch;
|
|
|
|
|
|
|
|
bool from_pool;
|
|
|
|
struct page **pages;
|
|
|
|
struct page **data_pages;
|
2017-08-30 11:36:06 +08:00
|
|
|
};
|
|
|
|
|
2009-10-06 11:31:09 -07:00
|
|
|
/*
|
|
|
|
* Get ref for the oldest snapc for an inode with dirty data... that is, the
|
|
|
|
* only snap context we are allowed to write back.
|
|
|
|
*/
|
2017-08-30 11:36:06 +08:00
|
|
|
static struct ceph_snap_context *
|
2017-09-02 10:50:48 +08:00
|
|
|
get_oldest_context(struct inode *inode, struct ceph_writeback_ctl *ctl,
|
|
|
|
struct ceph_snap_context *page_snapc)
|
2009-10-06 11:31:09 -07:00
|
|
|
{
|
|
|
|
struct ceph_inode_info *ci = ceph_inode(inode);
|
2023-06-12 09:04:07 +08:00
|
|
|
struct ceph_client *cl = ceph_inode_to_client(inode);
|
2009-10-06 11:31:09 -07:00
|
|
|
struct ceph_snap_context *snapc = NULL;
|
|
|
|
struct ceph_cap_snap *capsnap = NULL;
|
|
|
|
|
2011-11-30 09:47:09 -08:00
|
|
|
spin_lock(&ci->i_ceph_lock);
|
2009-10-06 11:31:09 -07:00
|
|
|
list_for_each_entry(capsnap, &ci->i_cap_snaps, ci_item) {
|
2023-06-12 09:04:07 +08:00
|
|
|
doutc(cl, " capsnap %p snapc %p has %d dirty pages\n",
|
|
|
|
capsnap, capsnap->context, capsnap->dirty_pages);
|
2017-09-02 10:50:48 +08:00
|
|
|
if (!capsnap->dirty_pages)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
/* get i_size, truncate_{seq,size} for page_snapc? */
|
|
|
|
if (snapc && capsnap->context != page_snapc)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
if (ctl) {
|
|
|
|
if (capsnap->writing) {
|
|
|
|
ctl->i_size = i_size_read(inode);
|
|
|
|
ctl->size_stable = false;
|
|
|
|
} else {
|
|
|
|
ctl->i_size = capsnap->size;
|
|
|
|
ctl->size_stable = true;
|
2017-08-30 11:36:06 +08:00
|
|
|
}
|
2017-09-02 10:50:48 +08:00
|
|
|
ctl->truncate_size = capsnap->truncate_size;
|
|
|
|
ctl->truncate_seq = capsnap->truncate_seq;
|
2017-09-01 16:53:58 +08:00
|
|
|
ctl->head_snapc = false;
|
2009-10-06 11:31:09 -07:00
|
|
|
}
|
2017-09-02 10:50:48 +08:00
|
|
|
|
|
|
|
if (snapc)
|
|
|
|
break;
|
|
|
|
|
|
|
|
snapc = ceph_get_snap_context(capsnap->context);
|
|
|
|
if (!page_snapc ||
|
|
|
|
page_snapc == snapc ||
|
|
|
|
page_snapc->seq > snapc->seq)
|
|
|
|
break;
|
2009-10-06 11:31:09 -07:00
|
|
|
}
|
2010-08-24 08:44:16 -07:00
|
|
|
if (!snapc && ci->i_wrbuffer_ref_head) {
|
2010-03-31 21:52:10 -07:00
|
|
|
snapc = ceph_get_snap_context(ci->i_head_snapc);
|
2023-06-12 09:04:07 +08:00
|
|
|
doutc(cl, " head snapc %p has %d dirty pages\n", snapc,
|
|
|
|
ci->i_wrbuffer_ref_head);
|
2017-08-30 11:36:06 +08:00
|
|
|
if (ctl) {
|
|
|
|
ctl->i_size = i_size_read(inode);
|
|
|
|
ctl->truncate_size = ci->i_truncate_size;
|
|
|
|
ctl->truncate_seq = ci->i_truncate_seq;
|
|
|
|
ctl->size_stable = false;
|
2017-09-01 16:53:58 +08:00
|
|
|
ctl->head_snapc = true;
|
2017-08-30 11:36:06 +08:00
|
|
|
}
|
2009-10-06 11:31:09 -07:00
|
|
|
}
|
2011-11-30 09:47:09 -08:00
|
|
|
spin_unlock(&ci->i_ceph_lock);
|
2009-10-06 11:31:09 -07:00
|
|
|
return snapc;
|
|
|
|
}
|
|
|
|
|
2017-08-30 11:36:06 +08:00
|
|
|
static u64 get_writepages_data_length(struct inode *inode,
|
|
|
|
struct page *page, u64 start)
|
|
|
|
{
|
|
|
|
struct ceph_inode_info *ci = ceph_inode(inode);
|
2022-08-25 09:31:25 -04:00
|
|
|
struct ceph_snap_context *snapc;
|
2017-08-30 11:36:06 +08:00
|
|
|
struct ceph_cap_snap *capsnap = NULL;
|
|
|
|
u64 end = i_size_read(inode);
|
2022-08-25 09:31:25 -04:00
|
|
|
u64 ret;
|
2017-08-30 11:36:06 +08:00
|
|
|
|
2022-08-25 09:31:25 -04:00
|
|
|
snapc = page_snap_context(ceph_fscrypt_pagecache_page(page));
|
2017-08-30 11:36:06 +08:00
|
|
|
if (snapc != ci->i_head_snapc) {
|
|
|
|
bool found = false;
|
|
|
|
spin_lock(&ci->i_ceph_lock);
|
|
|
|
list_for_each_entry(capsnap, &ci->i_cap_snaps, ci_item) {
|
|
|
|
if (capsnap->context == snapc) {
|
|
|
|
if (!capsnap->writing)
|
|
|
|
end = capsnap->size;
|
|
|
|
found = true;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
spin_unlock(&ci->i_ceph_lock);
|
|
|
|
WARN_ON(!found);
|
|
|
|
}
|
2022-08-25 09:31:25 -04:00
|
|
|
if (end > ceph_fscrypt_page_offset(page) + thp_size(page))
|
|
|
|
end = ceph_fscrypt_page_offset(page) + thp_size(page);
|
|
|
|
ret = end > start ? end - start : 0;
|
|
|
|
if (ret && fscrypt_is_bounce_page(page))
|
|
|
|
ret = round_up(ret, CEPH_FSCRYPT_BLOCK_SIZE);
|
|
|
|
return ret;
|
2017-08-30 11:36:06 +08:00
|
|
|
}
|
|
|
|
|
2009-10-06 11:31:09 -07:00
|
|
|
/*
|
2025-02-17 18:51:13 +00:00
|
|
|
* Write a folio, but leave it locked.
|
2009-10-06 11:31:09 -07:00
|
|
|
*
|
2019-07-02 12:35:52 -04:00
|
|
|
* If we get a write error, mark the mapping for error, but still adjust the
|
2025-02-17 18:51:13 +00:00
|
|
|
* dirty page accounting (i.e., folio is no longer dirty).
|
2009-10-06 11:31:09 -07:00
|
|
|
*/
|
2025-02-17 18:51:13 +00:00
|
|
|
static int write_folio_nounlock(struct folio *folio,
|
|
|
|
struct writeback_control *wbc)
|
2009-10-06 11:31:09 -07:00
|
|
|
{
|
2025-02-17 18:51:13 +00:00
|
|
|
struct page *page = &folio->page;
|
|
|
|
struct inode *inode = folio->mapping->host;
|
2020-07-14 14:37:15 -04:00
|
|
|
struct ceph_inode_info *ci = ceph_inode(inode);
|
2023-06-12 10:50:38 +08:00
|
|
|
struct ceph_fs_client *fsc = ceph_inode_to_fs_client(inode);
|
2023-06-12 09:04:07 +08:00
|
|
|
struct ceph_client *cl = fsc->client;
|
2010-03-31 22:01:38 -07:00
|
|
|
struct ceph_snap_context *snapc, *oldest;
|
2025-02-17 18:51:13 +00:00
|
|
|
loff_t page_off = folio_pos(folio);
|
2020-07-14 14:37:15 -04:00
|
|
|
int err;
|
2025-02-17 18:51:13 +00:00
|
|
|
loff_t len = folio_size(folio);
|
2022-08-25 09:31:25 -04:00
|
|
|
loff_t wlen;
|
2017-08-30 11:36:06 +08:00
|
|
|
struct ceph_writeback_ctl ceph_wbc;
|
2020-07-14 14:37:15 -04:00
|
|
|
struct ceph_osd_client *osdc = &fsc->client->osdc;
|
|
|
|
struct ceph_osd_request *req;
|
2021-12-07 08:44:51 -05:00
|
|
|
bool caching = ceph_is_cache_enabled(inode);
|
2022-08-25 09:31:25 -04:00
|
|
|
struct page *bounce_page = NULL;
|
2009-10-06 11:31:09 -07:00
|
|
|
|
2025-02-17 18:51:13 +00:00
|
|
|
doutc(cl, "%llx.%llx folio %p idx %lu\n", ceph_vinop(inode), folio,
|
|
|
|
folio->index);
|
2009-10-06 11:31:09 -07:00
|
|
|
|
2023-02-01 09:36:45 +08:00
|
|
|
if (ceph_inode_is_shutdown(inode))
|
|
|
|
return -EIO;
|
|
|
|
|
2009-10-06 11:31:09 -07:00
|
|
|
/* verify this is a writeable snap context */
|
2025-02-17 18:51:13 +00:00
|
|
|
snapc = page_snap_context(&folio->page);
|
2017-08-20 20:22:02 +02:00
|
|
|
if (!snapc) {
|
2025-02-17 18:51:13 +00:00
|
|
|
doutc(cl, "%llx.%llx folio %p not dirty?\n", ceph_vinop(inode),
|
|
|
|
folio);
|
2017-05-23 17:48:28 +08:00
|
|
|
return 0;
|
2009-10-06 11:31:09 -07:00
|
|
|
}
|
2017-09-02 10:50:48 +08:00
|
|
|
oldest = get_oldest_context(inode, &ceph_wbc, snapc);
|
2010-03-31 22:01:38 -07:00
|
|
|
if (snapc->seq > oldest->seq) {
|
2025-02-17 18:51:13 +00:00
|
|
|
doutc(cl, "%llx.%llx folio %p snapc %p not writeable - noop\n",
|
|
|
|
ceph_vinop(inode), folio, snapc);
|
2009-10-06 11:31:09 -07:00
|
|
|
/* we should only noop if called by kswapd */
|
2017-05-23 17:18:53 +08:00
|
|
|
WARN_ON(!(current->flags & PF_MEMALLOC));
|
2010-03-31 22:01:38 -07:00
|
|
|
ceph_put_snap_context(oldest);
|
2025-02-17 18:51:13 +00:00
|
|
|
folio_redirty_for_writepage(wbc, folio);
|
2017-05-23 17:48:28 +08:00
|
|
|
return 0;
|
2009-10-06 11:31:09 -07:00
|
|
|
}
|
2010-03-31 22:01:38 -07:00
|
|
|
ceph_put_snap_context(oldest);
|
2009-10-06 11:31:09 -07:00
|
|
|
|
|
|
|
/* is this a partial page at end of file? */
|
2017-08-30 11:36:06 +08:00
|
|
|
if (page_off >= ceph_wbc.i_size) {
|
2023-06-12 09:04:07 +08:00
|
|
|
doutc(cl, "%llx.%llx folio at %lu beyond eof %llu\n",
|
|
|
|
ceph_vinop(inode), folio->index, ceph_wbc.i_size);
|
2022-02-09 20:21:30 +00:00
|
|
|
folio_invalidate(folio, 0, folio_size(folio));
|
2017-05-23 17:48:28 +08:00
|
|
|
return 0;
|
2013-05-31 16:48:29 +08:00
|
|
|
}
|
2017-05-23 17:48:28 +08:00
|
|
|
|
2017-08-30 11:36:06 +08:00
|
|
|
if (ceph_wbc.i_size < page_off + len)
|
|
|
|
len = ceph_wbc.i_size - page_off;
|
2009-10-06 11:31:09 -07:00
|
|
|
|
2022-08-25 09:31:25 -04:00
|
|
|
wlen = IS_ENCRYPTED(inode) ? round_up(len, CEPH_FSCRYPT_BLOCK_SIZE) : len;
|
2025-02-17 18:51:13 +00:00
|
|
|
doutc(cl, "%llx.%llx folio %p index %lu on %llu~%llu snapc %p seq %lld\n",
|
|
|
|
ceph_vinop(inode), folio, folio->index, page_off, wlen, snapc,
|
2023-06-12 09:04:07 +08:00
|
|
|
snapc->seq);
|
2009-10-06 11:31:09 -07:00
|
|
|
|
2017-12-15 16:57:40 +08:00
|
|
|
if (atomic_long_inc_return(&fsc->writeback_count) >
|
2010-04-06 15:14:15 -07:00
|
|
|
CONGESTION_ON_THRESH(fsc->mount_options->congestion_kb))
|
2022-03-22 14:39:04 -07:00
|
|
|
fsc->write_congested = true;
|
2009-12-18 13:51:57 -08:00
|
|
|
|
2022-08-25 09:31:25 -04:00
|
|
|
req = ceph_osdc_new_request(osdc, &ci->i_layout, ceph_vino(inode),
|
|
|
|
page_off, &wlen, 0, 1, CEPH_OSD_OP_WRITE,
|
|
|
|
CEPH_OSD_FLAG_WRITE, snapc,
|
|
|
|
ceph_wbc.truncate_seq,
|
|
|
|
ceph_wbc.truncate_size, true);
|
2022-04-24 17:35:53 +08:00
|
|
|
if (IS_ERR(req)) {
|
2025-02-17 18:51:13 +00:00
|
|
|
folio_redirty_for_writepage(wbc, folio);
|
2020-07-14 14:37:15 -04:00
|
|
|
return PTR_ERR(req);
|
2022-04-24 17:35:53 +08:00
|
|
|
}
|
2021-12-07 08:44:51 -05:00
|
|
|
|
2022-08-25 09:31:25 -04:00
|
|
|
if (wlen < len)
|
|
|
|
len = wlen;
|
|
|
|
|
2025-02-17 18:51:13 +00:00
|
|
|
folio_start_writeback(folio);
|
2024-07-30 17:01:40 +01:00
|
|
|
if (caching)
|
2025-02-17 18:51:13 +00:00
|
|
|
ceph_set_page_fscache(&folio->page);
|
2021-12-07 08:44:51 -05:00
|
|
|
ceph_fscache_write_to_cache(inode, page_off, len, caching);
|
2020-07-14 14:37:15 -04:00
|
|
|
|
2022-08-25 09:31:25 -04:00
|
|
|
if (IS_ENCRYPTED(inode)) {
|
2025-03-04 17:02:23 +00:00
|
|
|
bounce_page = fscrypt_encrypt_pagecache_blocks(folio,
|
2022-08-25 09:31:25 -04:00
|
|
|
CEPH_FSCRYPT_BLOCK_SIZE, 0,
|
|
|
|
GFP_NOFS);
|
|
|
|
if (IS_ERR(bounce_page)) {
|
2025-02-17 18:51:13 +00:00
|
|
|
folio_redirty_for_writepage(wbc, folio);
|
|
|
|
folio_end_writeback(folio);
|
2022-08-25 09:31:25 -04:00
|
|
|
ceph_osdc_put_request(req);
|
|
|
|
return PTR_ERR(bounce_page);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2020-07-14 14:37:15 -04:00
|
|
|
/* it may be a short write due to an object boundary */
|
2025-02-17 18:51:13 +00:00
|
|
|
WARN_ON_ONCE(len > folio_size(folio));
|
2022-08-25 09:31:25 -04:00
|
|
|
osd_req_op_extent_osd_data_pages(req, 0,
|
|
|
|
bounce_page ? &bounce_page : &page, wlen, 0,
|
|
|
|
false, false);
|
2023-06-12 09:04:07 +08:00
|
|
|
doutc(cl, "%llx.%llx %llu~%llu (%llu bytes, %sencrypted)\n",
|
|
|
|
ceph_vinop(inode), page_off, len, wlen,
|
|
|
|
IS_ENCRYPTED(inode) ? "" : "not ");
|
2020-07-14 14:37:15 -04:00
|
|
|
|
2023-10-04 14:52:09 -04:00
|
|
|
req->r_mtime = inode_get_mtime(inode);
|
2022-06-30 16:21:50 -04:00
|
|
|
ceph_osdc_start_request(osdc, req);
|
|
|
|
err = ceph_osdc_wait_request(osdc, req);
|
2020-07-14 14:37:15 -04:00
|
|
|
|
2021-03-22 20:28:49 +08:00
|
|
|
ceph_update_write_metrics(&fsc->mdsc->metric, req->r_start_latency,
|
2021-05-13 09:40:53 +08:00
|
|
|
req->r_end_latency, len, err);
|
2022-08-25 09:31:25 -04:00
|
|
|
fscrypt_free_bounce_page(bounce_page);
|
2020-07-14 14:37:15 -04:00
|
|
|
ceph_osdc_put_request(req);
|
|
|
|
if (err == 0)
|
|
|
|
err = len;
|
|
|
|
|
2009-10-06 11:31:09 -07:00
|
|
|
if (err < 0) {
|
2016-05-13 17:29:51 +08:00
|
|
|
struct writeback_control tmp_wbc;
|
|
|
|
if (!wbc)
|
|
|
|
wbc = &tmp_wbc;
|
|
|
|
if (err == -ERESTARTSYS) {
|
|
|
|
/* killed by SIGKILL */
|
2023-06-12 09:04:07 +08:00
|
|
|
doutc(cl, "%llx.%llx interrupted page %p\n",
|
2025-02-17 18:51:13 +00:00
|
|
|
ceph_vinop(inode), folio);
|
|
|
|
folio_redirty_for_writepage(wbc, folio);
|
|
|
|
folio_end_writeback(folio);
|
2017-05-23 17:48:28 +08:00
|
|
|
return err;
|
2016-05-13 17:29:51 +08:00
|
|
|
}
|
2020-09-14 13:39:19 +02:00
|
|
|
if (err == -EBLOCKLISTED)
|
|
|
|
fsc->blocklisted = true;
|
2025-02-17 18:51:13 +00:00
|
|
|
doutc(cl, "%llx.%llx setting mapping error %d %p\n",
|
|
|
|
ceph_vinop(inode), err, folio);
|
2009-10-06 11:31:09 -07:00
|
|
|
mapping_set_error(&inode->i_data, err);
|
2016-05-13 17:29:51 +08:00
|
|
|
wbc->pages_skipped++;
|
2009-10-06 11:31:09 -07:00
|
|
|
} else {
|
2023-06-12 09:04:07 +08:00
|
|
|
doutc(cl, "%llx.%llx cleaned page %p\n",
|
2025-02-17 18:51:13 +00:00
|
|
|
ceph_vinop(inode), folio);
|
2009-10-06 11:31:09 -07:00
|
|
|
err = 0; /* vfs expects us to return 0 */
|
|
|
|
}
|
2025-02-17 18:51:13 +00:00
|
|
|
oldest = folio_detach_private(folio);
|
2021-03-23 15:16:52 -04:00
|
|
|
WARN_ON_ONCE(oldest != snapc);
|
2025-02-17 18:51:13 +00:00
|
|
|
folio_end_writeback(folio);
|
2009-10-06 11:31:09 -07:00
|
|
|
ceph_put_wrbuffer_cap_refs(ci, 1, snapc);
|
2010-03-31 22:01:38 -07:00
|
|
|
ceph_put_snap_context(snapc); /* page's reference */
|
2017-12-15 16:57:40 +08:00
|
|
|
|
|
|
|
if (atomic_long_dec_return(&fsc->writeback_count) <
|
|
|
|
CONGESTION_OFF_THRESH(fsc->mount_options->congestion_kb))
|
2022-03-22 14:39:04 -07:00
|
|
|
fsc->write_congested = false;
|
2017-12-15 16:57:40 +08:00
|
|
|
|
2009-10-06 11:31:09 -07:00
|
|
|
return err;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* async writeback completion handler.
|
|
|
|
*
|
|
|
|
* If we get an error, set the mapping error bit, but not the individual
|
|
|
|
* page error bits.
|
|
|
|
*/
|
2016-04-28 16:07:24 +02:00
|
|
|
static void writepages_finish(struct ceph_osd_request *req)
|
2009-10-06 11:31:09 -07:00
|
|
|
{
|
|
|
|
struct inode *inode = req->r_inode;
|
|
|
|
struct ceph_inode_info *ci = ceph_inode(inode);
|
2023-06-12 09:04:07 +08:00
|
|
|
struct ceph_client *cl = ceph_inode_to_client(inode);
|
2013-04-03 01:28:58 -05:00
|
|
|
struct ceph_osd_data *osd_data;
|
2009-10-06 11:31:09 -07:00
|
|
|
struct page *page;
|
2016-01-07 16:00:17 +08:00
|
|
|
int num_pages, total_pages = 0;
|
|
|
|
int i, j;
|
|
|
|
int rc = req->r_result;
|
2009-10-06 11:31:09 -07:00
|
|
|
struct ceph_snap_context *snapc = req->r_snapc;
|
|
|
|
struct address_space *mapping = inode->i_mapping;
|
2023-06-12 10:50:38 +08:00
|
|
|
struct ceph_fs_client *fsc = ceph_inode_to_fs_client(inode);
|
2025-02-04 16:02:49 -08:00
|
|
|
struct ceph_mds_client *mdsc = ceph_sb_to_mdsc(inode->i_sb);
|
2021-05-13 09:40:53 +08:00
|
|
|
unsigned int len = 0;
|
2016-01-07 16:00:17 +08:00
|
|
|
bool remove_page;
|
2009-10-06 11:31:09 -07:00
|
|
|
|
2023-06-12 09:04:07 +08:00
|
|
|
doutc(cl, "%llx.%llx rc %d\n", ceph_vinop(inode), rc);
|
2017-04-04 08:39:46 -04:00
|
|
|
if (rc < 0) {
|
2009-10-06 11:31:09 -07:00
|
|
|
mapping_set_error(mapping, rc);
|
2017-04-04 08:39:46 -04:00
|
|
|
ceph_set_error_write(ci);
|
2020-09-14 13:39:19 +02:00
|
|
|
if (rc == -EBLOCKLISTED)
|
|
|
|
fsc->blocklisted = true;
|
2017-04-04 08:39:46 -04:00
|
|
|
} else {
|
|
|
|
ceph_clear_error_write(ci);
|
|
|
|
}
|
2016-01-07 16:00:17 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* We lost the cache cap, need to truncate the page before
|
|
|
|
* it is unlocked, otherwise we'd truncate it later in the
|
|
|
|
* page truncation thread, possibly losing some data that
|
|
|
|
* raced its way in
|
|
|
|
*/
|
|
|
|
remove_page = !(ceph_caps_issued(ci) &
|
|
|
|
(CEPH_CAP_FILE_CACHE|CEPH_CAP_FILE_LAZYIO));
|
2009-10-06 11:31:09 -07:00
|
|
|
|
|
|
|
/* clean all pages */
|
2016-01-07 16:00:17 +08:00
|
|
|
for (i = 0; i < req->r_num_ops; i++) {
|
2022-05-05 18:53:09 +08:00
|
|
|
if (req->r_ops[i].op != CEPH_OSD_OP_WRITE) {
|
2023-06-12 09:04:07 +08:00
|
|
|
pr_warn_client(cl,
|
|
|
|
"%llx.%llx incorrect op %d req %p index %d tid %llu\n",
|
|
|
|
ceph_vinop(inode), req->r_ops[i].op, req, i,
|
|
|
|
req->r_tid);
|
2016-01-07 16:00:17 +08:00
|
|
|
break;
|
2022-05-05 18:53:09 +08:00
|
|
|
}
|
2010-02-19 00:07:01 +00:00
|
|
|
|
2016-01-07 16:00:17 +08:00
|
|
|
osd_data = osd_req_op_extent_osd_data(req, i);
|
|
|
|
BUG_ON(osd_data->type != CEPH_OSD_DATA_TYPE_PAGES);
|
2021-05-13 09:40:53 +08:00
|
|
|
len += osd_data->length;
|
2016-01-07 16:00:17 +08:00
|
|
|
num_pages = calc_pages_for((u64)osd_data->alignment,
|
|
|
|
(u64)osd_data->length);
|
|
|
|
total_pages += num_pages;
|
|
|
|
for (j = 0; j < num_pages; j++) {
|
|
|
|
page = osd_data->pages[j];
|
2022-08-25 09:31:25 -04:00
|
|
|
if (fscrypt_is_bounce_page(page)) {
|
|
|
|
page = fscrypt_pagecache_page(page);
|
|
|
|
fscrypt_free_bounce_page(osd_data->pages[j]);
|
|
|
|
osd_data->pages[j] = page;
|
|
|
|
}
|
2016-01-07 16:00:17 +08:00
|
|
|
BUG_ON(!page);
|
|
|
|
WARN_ON(!PageUptodate(page));
|
|
|
|
|
|
|
|
if (atomic_long_dec_return(&fsc->writeback_count) <
|
|
|
|
CONGESTION_OFF_THRESH(
|
|
|
|
fsc->mount_options->congestion_kb))
|
2022-03-22 14:39:04 -07:00
|
|
|
fsc->write_congested = false;
|
2016-01-07 16:00:17 +08:00
|
|
|
|
2021-03-23 15:16:52 -04:00
|
|
|
ceph_put_snap_context(detach_page_private(page));
|
2016-01-07 16:00:17 +08:00
|
|
|
end_page_writeback(page);
|
2025-02-04 16:02:49 -08:00
|
|
|
|
|
|
|
if (atomic64_dec_return(&mdsc->dirty_folios) <= 0) {
|
|
|
|
wake_up_all(&mdsc->flush_end_wq);
|
|
|
|
WARN_ON(atomic64_read(&mdsc->dirty_folios) < 0);
|
|
|
|
}
|
|
|
|
|
2023-06-12 09:04:07 +08:00
|
|
|
doutc(cl, "unlocking %p\n", page);
|
2016-01-07 16:00:17 +08:00
|
|
|
|
|
|
|
if (remove_page)
|
2023-11-17 16:14:47 +00:00
|
|
|
generic_error_remove_folio(inode->i_mapping,
|
|
|
|
page_folio(page));
|
2016-01-07 16:00:17 +08:00
|
|
|
|
|
|
|
unlock_page(page);
|
|
|
|
}
|
2023-06-12 09:04:07 +08:00
|
|
|
doutc(cl, "%llx.%llx wrote %llu bytes cleaned %d pages\n",
|
|
|
|
ceph_vinop(inode), osd_data->length,
|
|
|
|
rc >= 0 ? num_pages : 0);
|
2010-02-19 00:07:01 +00:00
|
|
|
|
2019-08-08 20:56:47 -07:00
|
|
|
release_pages(osd_data->pages, num_pages);
|
2009-10-06 11:31:09 -07:00
|
|
|
}
|
|
|
|
|
2021-05-13 09:40:53 +08:00
|
|
|
ceph_update_write_metrics(&fsc->mdsc->metric, req->r_start_latency,
|
|
|
|
req->r_end_latency, len, rc);
|
|
|
|
|
2016-01-07 16:00:17 +08:00
|
|
|
ceph_put_wrbuffer_cap_refs(ci, total_pages, snapc);
|
|
|
|
|
|
|
|
osd_data = osd_req_op_extent_osd_data(req, 0);
|
2013-04-03 01:28:58 -05:00
|
|
|
if (osd_data->pages_from_pool)
|
2020-07-30 11:03:55 -04:00
|
|
|
mempool_free(osd_data->pages, ceph_wb_pagevec_pool);
|
2009-10-06 11:31:09 -07:00
|
|
|
else
|
2013-04-03 01:28:58 -05:00
|
|
|
kfree(osd_data->pages);
|
2009-10-06 11:31:09 -07:00
|
|
|
ceph_osdc_put_request(req);
|
2023-05-09 16:38:49 +08:00
|
|
|
ceph_dec_osd_stopping_blocker(fsc->mdsc);
|
2009-10-06 11:31:09 -07:00
|
|
|
}
|
|
|
|
|
ceph: introduce ceph_process_folio_batch() method
First step of ceph_writepages_start() logic is
of finding the dirty memory folios and processing it.
This patch introduces ceph_process_folio_batch()
method that moves this logic into dedicated method.
The ceph_writepages_start() has this logic:
if (ceph_wbc.locked_pages == 0)
lock_page(page); /* first page */
else if (!trylock_page(page))
break;
<skipped>
if (folio_test_writeback(folio) ||
folio_test_private_2(folio) /* [DEPRECATED] */) {
if (wbc->sync_mode == WB_SYNC_NONE) {
doutc(cl, "%p under writeback\n", folio);
folio_unlock(folio);
continue;
}
doutc(cl, "waiting on writeback %p\n", folio);
folio_wait_writeback(folio);
folio_wait_private_2(folio); /* [DEPRECATED] */
}
The problem here that folio/page is locked here at first
and it is by set_page_writeback(page) later before
submitting the write request. The folio/page is unlocked
by writepages_finish() after finishing the write
request. It means that logic of checking folio_test_writeback()
and folio_wait_writeback() never works because page is locked
and it cannot be locked again until write request completion.
However, for majority of folios/pages the trylock_page()
is used. As a result, multiple threads can try to lock the same
folios/pages multiple times even if they are under writeback
already. It makes this logic more compute intensive than
it is necessary.
This patch changes this logic:
if (folio_test_writeback(folio) ||
folio_test_private_2(folio) /* [DEPRECATED] */) {
if (wbc->sync_mode == WB_SYNC_NONE) {
doutc(cl, "%p under writeback\n", folio);
folio_unlock(folio);
continue;
}
doutc(cl, "waiting on writeback %p\n", folio);
folio_wait_writeback(folio);
folio_wait_private_2(folio); /* [DEPRECATED] */
}
if (ceph_wbc.locked_pages == 0)
lock_page(page); /* first page */
else if (!trylock_page(page))
break;
This logic should exclude the ignoring of writeback
state of folios/pages.
Signed-off-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Link: https://lore.kernel.org/r/20250205000249.123054-3-slava@dubeyko.com
Tested-by: David Howells <dhowells@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-04 16:02:47 -08:00
|
|
|
static inline
|
|
|
|
bool is_forced_umount(struct address_space *mapping)
|
|
|
|
{
|
|
|
|
struct inode *inode = mapping->host;
|
|
|
|
struct ceph_inode_info *ci = ceph_inode(inode);
|
|
|
|
struct ceph_fs_client *fsc = ceph_inode_to_fs_client(inode);
|
|
|
|
struct ceph_client *cl = fsc->client;
|
|
|
|
|
|
|
|
if (ceph_inode_is_shutdown(inode)) {
|
|
|
|
if (ci->i_wrbuffer_ref > 0) {
|
|
|
|
pr_warn_ratelimited_client(cl,
|
|
|
|
"%llx.%llx %lld forced umount\n",
|
|
|
|
ceph_vinop(inode), ceph_ino(inode));
|
|
|
|
}
|
|
|
|
mapping_set_error(mapping, -EIO);
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
2025-02-04 16:02:46 -08:00
|
|
|
static inline
|
|
|
|
unsigned int ceph_define_write_size(struct address_space *mapping)
|
|
|
|
{
|
|
|
|
struct inode *inode = mapping->host;
|
|
|
|
struct ceph_fs_client *fsc = ceph_inode_to_fs_client(inode);
|
|
|
|
unsigned int wsize = i_blocksize(inode);
|
|
|
|
|
|
|
|
if (fsc->mount_options->wsize < wsize)
|
|
|
|
wsize = fsc->mount_options->wsize;
|
|
|
|
|
|
|
|
return wsize;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline
|
|
|
|
void ceph_folio_batch_init(struct ceph_writeback_ctl *ceph_wbc)
|
|
|
|
{
|
|
|
|
folio_batch_init(&ceph_wbc->fbatch);
|
|
|
|
ceph_wbc->processed_in_fbatch = 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline
|
|
|
|
void ceph_folio_batch_reinit(struct ceph_writeback_ctl *ceph_wbc)
|
|
|
|
{
|
|
|
|
folio_batch_release(&ceph_wbc->fbatch);
|
|
|
|
ceph_folio_batch_init(ceph_wbc);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline
|
|
|
|
void ceph_init_writeback_ctl(struct address_space *mapping,
|
|
|
|
struct writeback_control *wbc,
|
|
|
|
struct ceph_writeback_ctl *ceph_wbc)
|
|
|
|
{
|
|
|
|
ceph_wbc->snapc = NULL;
|
|
|
|
ceph_wbc->last_snapc = NULL;
|
|
|
|
|
|
|
|
ceph_wbc->strip_unit_end = 0;
|
|
|
|
ceph_wbc->wsize = ceph_define_write_size(mapping);
|
|
|
|
|
|
|
|
ceph_wbc->nr_folios = 0;
|
|
|
|
ceph_wbc->max_pages = 0;
|
|
|
|
ceph_wbc->locked_pages = 0;
|
|
|
|
|
|
|
|
ceph_wbc->done = false;
|
|
|
|
ceph_wbc->should_loop = false;
|
|
|
|
ceph_wbc->range_whole = false;
|
|
|
|
|
|
|
|
ceph_wbc->start_index = wbc->range_cyclic ? mapping->writeback_index : 0;
|
|
|
|
ceph_wbc->index = ceph_wbc->start_index;
|
|
|
|
ceph_wbc->end = -1;
|
|
|
|
|
|
|
|
if (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages) {
|
|
|
|
ceph_wbc->tag = PAGECACHE_TAG_TOWRITE;
|
|
|
|
} else {
|
|
|
|
ceph_wbc->tag = PAGECACHE_TAG_DIRTY;
|
|
|
|
}
|
|
|
|
|
|
|
|
ceph_wbc->op_idx = -1;
|
|
|
|
ceph_wbc->num_ops = 0;
|
|
|
|
ceph_wbc->offset = 0;
|
|
|
|
ceph_wbc->len = 0;
|
|
|
|
ceph_wbc->from_pool = false;
|
|
|
|
|
|
|
|
ceph_folio_batch_init(ceph_wbc);
|
|
|
|
|
|
|
|
ceph_wbc->pages = NULL;
|
|
|
|
ceph_wbc->data_pages = NULL;
|
|
|
|
}
|
|
|
|
|
ceph: introduce ceph_process_folio_batch() method
First step of ceph_writepages_start() logic is
of finding the dirty memory folios and processing it.
This patch introduces ceph_process_folio_batch()
method that moves this logic into dedicated method.
The ceph_writepages_start() has this logic:
if (ceph_wbc.locked_pages == 0)
lock_page(page); /* first page */
else if (!trylock_page(page))
break;
<skipped>
if (folio_test_writeback(folio) ||
folio_test_private_2(folio) /* [DEPRECATED] */) {
if (wbc->sync_mode == WB_SYNC_NONE) {
doutc(cl, "%p under writeback\n", folio);
folio_unlock(folio);
continue;
}
doutc(cl, "waiting on writeback %p\n", folio);
folio_wait_writeback(folio);
folio_wait_private_2(folio); /* [DEPRECATED] */
}
The problem here that folio/page is locked here at first
and it is by set_page_writeback(page) later before
submitting the write request. The folio/page is unlocked
by writepages_finish() after finishing the write
request. It means that logic of checking folio_test_writeback()
and folio_wait_writeback() never works because page is locked
and it cannot be locked again until write request completion.
However, for majority of folios/pages the trylock_page()
is used. As a result, multiple threads can try to lock the same
folios/pages multiple times even if they are under writeback
already. It makes this logic more compute intensive than
it is necessary.
This patch changes this logic:
if (folio_test_writeback(folio) ||
folio_test_private_2(folio) /* [DEPRECATED] */) {
if (wbc->sync_mode == WB_SYNC_NONE) {
doutc(cl, "%p under writeback\n", folio);
folio_unlock(folio);
continue;
}
doutc(cl, "waiting on writeback %p\n", folio);
folio_wait_writeback(folio);
folio_wait_private_2(folio); /* [DEPRECATED] */
}
if (ceph_wbc.locked_pages == 0)
lock_page(page); /* first page */
else if (!trylock_page(page))
break;
This logic should exclude the ignoring of writeback
state of folios/pages.
Signed-off-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Link: https://lore.kernel.org/r/20250205000249.123054-3-slava@dubeyko.com
Tested-by: David Howells <dhowells@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-04 16:02:47 -08:00
|
|
|
static inline
|
|
|
|
int ceph_define_writeback_range(struct address_space *mapping,
|
|
|
|
struct writeback_control *wbc,
|
|
|
|
struct ceph_writeback_ctl *ceph_wbc)
|
|
|
|
{
|
|
|
|
struct inode *inode = mapping->host;
|
|
|
|
struct ceph_fs_client *fsc = ceph_inode_to_fs_client(inode);
|
|
|
|
struct ceph_client *cl = fsc->client;
|
|
|
|
|
|
|
|
/* find oldest snap context with dirty data */
|
|
|
|
ceph_wbc->snapc = get_oldest_context(inode, ceph_wbc, NULL);
|
|
|
|
if (!ceph_wbc->snapc) {
|
|
|
|
/* hmm, why does writepages get called when there
|
|
|
|
is no dirty data? */
|
|
|
|
doutc(cl, " no snap context with dirty data?\n");
|
|
|
|
return -ENODATA;
|
|
|
|
}
|
|
|
|
|
|
|
|
doutc(cl, " oldest snapc is %p seq %lld (%d snaps)\n",
|
|
|
|
ceph_wbc->snapc, ceph_wbc->snapc->seq,
|
|
|
|
ceph_wbc->snapc->num_snaps);
|
|
|
|
|
|
|
|
ceph_wbc->should_loop = false;
|
|
|
|
|
|
|
|
if (ceph_wbc->head_snapc && ceph_wbc->snapc != ceph_wbc->last_snapc) {
|
|
|
|
/* where to start/end? */
|
|
|
|
if (wbc->range_cyclic) {
|
|
|
|
ceph_wbc->index = ceph_wbc->start_index;
|
|
|
|
ceph_wbc->end = -1;
|
|
|
|
if (ceph_wbc->index > 0)
|
|
|
|
ceph_wbc->should_loop = true;
|
|
|
|
doutc(cl, " cyclic, start at %lu\n", ceph_wbc->index);
|
|
|
|
} else {
|
|
|
|
ceph_wbc->index = wbc->range_start >> PAGE_SHIFT;
|
|
|
|
ceph_wbc->end = wbc->range_end >> PAGE_SHIFT;
|
|
|
|
if (wbc->range_start == 0 && wbc->range_end == LLONG_MAX)
|
|
|
|
ceph_wbc->range_whole = true;
|
|
|
|
doutc(cl, " not cyclic, %lu to %lu\n",
|
|
|
|
ceph_wbc->index, ceph_wbc->end);
|
|
|
|
}
|
|
|
|
} else if (!ceph_wbc->head_snapc) {
|
|
|
|
/* Do not respect wbc->range_{start,end}. Dirty pages
|
|
|
|
* in that range can be associated with newer snapc.
|
|
|
|
* They are not writeable until we write all dirty pages
|
|
|
|
* associated with 'snapc' get written */
|
|
|
|
if (ceph_wbc->index > 0)
|
|
|
|
ceph_wbc->should_loop = true;
|
|
|
|
doutc(cl, " non-head snapc, range whole\n");
|
|
|
|
}
|
|
|
|
|
|
|
|
ceph_put_snap_context(ceph_wbc->last_snapc);
|
|
|
|
ceph_wbc->last_snapc = ceph_wbc->snapc;
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline
|
|
|
|
bool has_writeback_done(struct ceph_writeback_ctl *ceph_wbc)
|
|
|
|
{
|
|
|
|
return ceph_wbc->done && ceph_wbc->index > ceph_wbc->end;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline
|
|
|
|
bool can_next_page_be_processed(struct ceph_writeback_ctl *ceph_wbc,
|
|
|
|
unsigned index)
|
|
|
|
{
|
|
|
|
return index < ceph_wbc->nr_folios &&
|
|
|
|
ceph_wbc->locked_pages < ceph_wbc->max_pages;
|
|
|
|
}
|
|
|
|
|
|
|
|
static
|
|
|
|
int ceph_check_page_before_write(struct address_space *mapping,
|
|
|
|
struct writeback_control *wbc,
|
|
|
|
struct ceph_writeback_ctl *ceph_wbc,
|
|
|
|
struct folio *folio)
|
|
|
|
{
|
|
|
|
struct inode *inode = mapping->host;
|
|
|
|
struct ceph_fs_client *fsc = ceph_inode_to_fs_client(inode);
|
|
|
|
struct ceph_client *cl = fsc->client;
|
|
|
|
struct ceph_snap_context *pgsnapc;
|
|
|
|
|
2025-02-17 18:51:14 +00:00
|
|
|
/* only dirty folios, or our accounting breaks */
|
|
|
|
if (unlikely(!folio_test_dirty(folio) || folio->mapping != mapping)) {
|
|
|
|
doutc(cl, "!dirty or !mapping %p\n", folio);
|
ceph: introduce ceph_process_folio_batch() method
First step of ceph_writepages_start() logic is
of finding the dirty memory folios and processing it.
This patch introduces ceph_process_folio_batch()
method that moves this logic into dedicated method.
The ceph_writepages_start() has this logic:
if (ceph_wbc.locked_pages == 0)
lock_page(page); /* first page */
else if (!trylock_page(page))
break;
<skipped>
if (folio_test_writeback(folio) ||
folio_test_private_2(folio) /* [DEPRECATED] */) {
if (wbc->sync_mode == WB_SYNC_NONE) {
doutc(cl, "%p under writeback\n", folio);
folio_unlock(folio);
continue;
}
doutc(cl, "waiting on writeback %p\n", folio);
folio_wait_writeback(folio);
folio_wait_private_2(folio); /* [DEPRECATED] */
}
The problem here that folio/page is locked here at first
and it is by set_page_writeback(page) later before
submitting the write request. The folio/page is unlocked
by writepages_finish() after finishing the write
request. It means that logic of checking folio_test_writeback()
and folio_wait_writeback() never works because page is locked
and it cannot be locked again until write request completion.
However, for majority of folios/pages the trylock_page()
is used. As a result, multiple threads can try to lock the same
folios/pages multiple times even if they are under writeback
already. It makes this logic more compute intensive than
it is necessary.
This patch changes this logic:
if (folio_test_writeback(folio) ||
folio_test_private_2(folio) /* [DEPRECATED] */) {
if (wbc->sync_mode == WB_SYNC_NONE) {
doutc(cl, "%p under writeback\n", folio);
folio_unlock(folio);
continue;
}
doutc(cl, "waiting on writeback %p\n", folio);
folio_wait_writeback(folio);
folio_wait_private_2(folio); /* [DEPRECATED] */
}
if (ceph_wbc.locked_pages == 0)
lock_page(page); /* first page */
else if (!trylock_page(page))
break;
This logic should exclude the ignoring of writeback
state of folios/pages.
Signed-off-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Link: https://lore.kernel.org/r/20250205000249.123054-3-slava@dubeyko.com
Tested-by: David Howells <dhowells@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-04 16:02:47 -08:00
|
|
|
return -ENODATA;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* only if matching snap context */
|
2025-02-17 18:51:14 +00:00
|
|
|
pgsnapc = page_snap_context(&folio->page);
|
ceph: introduce ceph_process_folio_batch() method
First step of ceph_writepages_start() logic is
of finding the dirty memory folios and processing it.
This patch introduces ceph_process_folio_batch()
method that moves this logic into dedicated method.
The ceph_writepages_start() has this logic:
if (ceph_wbc.locked_pages == 0)
lock_page(page); /* first page */
else if (!trylock_page(page))
break;
<skipped>
if (folio_test_writeback(folio) ||
folio_test_private_2(folio) /* [DEPRECATED] */) {
if (wbc->sync_mode == WB_SYNC_NONE) {
doutc(cl, "%p under writeback\n", folio);
folio_unlock(folio);
continue;
}
doutc(cl, "waiting on writeback %p\n", folio);
folio_wait_writeback(folio);
folio_wait_private_2(folio); /* [DEPRECATED] */
}
The problem here that folio/page is locked here at first
and it is by set_page_writeback(page) later before
submitting the write request. The folio/page is unlocked
by writepages_finish() after finishing the write
request. It means that logic of checking folio_test_writeback()
and folio_wait_writeback() never works because page is locked
and it cannot be locked again until write request completion.
However, for majority of folios/pages the trylock_page()
is used. As a result, multiple threads can try to lock the same
folios/pages multiple times even if they are under writeback
already. It makes this logic more compute intensive than
it is necessary.
This patch changes this logic:
if (folio_test_writeback(folio) ||
folio_test_private_2(folio) /* [DEPRECATED] */) {
if (wbc->sync_mode == WB_SYNC_NONE) {
doutc(cl, "%p under writeback\n", folio);
folio_unlock(folio);
continue;
}
doutc(cl, "waiting on writeback %p\n", folio);
folio_wait_writeback(folio);
folio_wait_private_2(folio); /* [DEPRECATED] */
}
if (ceph_wbc.locked_pages == 0)
lock_page(page); /* first page */
else if (!trylock_page(page))
break;
This logic should exclude the ignoring of writeback
state of folios/pages.
Signed-off-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Link: https://lore.kernel.org/r/20250205000249.123054-3-slava@dubeyko.com
Tested-by: David Howells <dhowells@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-04 16:02:47 -08:00
|
|
|
if (pgsnapc != ceph_wbc->snapc) {
|
2025-02-17 18:51:14 +00:00
|
|
|
doutc(cl, "folio snapc %p %lld != oldest %p %lld\n",
|
ceph: introduce ceph_process_folio_batch() method
First step of ceph_writepages_start() logic is
of finding the dirty memory folios and processing it.
This patch introduces ceph_process_folio_batch()
method that moves this logic into dedicated method.
The ceph_writepages_start() has this logic:
if (ceph_wbc.locked_pages == 0)
lock_page(page); /* first page */
else if (!trylock_page(page))
break;
<skipped>
if (folio_test_writeback(folio) ||
folio_test_private_2(folio) /* [DEPRECATED] */) {
if (wbc->sync_mode == WB_SYNC_NONE) {
doutc(cl, "%p under writeback\n", folio);
folio_unlock(folio);
continue;
}
doutc(cl, "waiting on writeback %p\n", folio);
folio_wait_writeback(folio);
folio_wait_private_2(folio); /* [DEPRECATED] */
}
The problem here that folio/page is locked here at first
and it is by set_page_writeback(page) later before
submitting the write request. The folio/page is unlocked
by writepages_finish() after finishing the write
request. It means that logic of checking folio_test_writeback()
and folio_wait_writeback() never works because page is locked
and it cannot be locked again until write request completion.
However, for majority of folios/pages the trylock_page()
is used. As a result, multiple threads can try to lock the same
folios/pages multiple times even if they are under writeback
already. It makes this logic more compute intensive than
it is necessary.
This patch changes this logic:
if (folio_test_writeback(folio) ||
folio_test_private_2(folio) /* [DEPRECATED] */) {
if (wbc->sync_mode == WB_SYNC_NONE) {
doutc(cl, "%p under writeback\n", folio);
folio_unlock(folio);
continue;
}
doutc(cl, "waiting on writeback %p\n", folio);
folio_wait_writeback(folio);
folio_wait_private_2(folio); /* [DEPRECATED] */
}
if (ceph_wbc.locked_pages == 0)
lock_page(page); /* first page */
else if (!trylock_page(page))
break;
This logic should exclude the ignoring of writeback
state of folios/pages.
Signed-off-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Link: https://lore.kernel.org/r/20250205000249.123054-3-slava@dubeyko.com
Tested-by: David Howells <dhowells@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-04 16:02:47 -08:00
|
|
|
pgsnapc, pgsnapc->seq,
|
|
|
|
ceph_wbc->snapc, ceph_wbc->snapc->seq);
|
|
|
|
|
|
|
|
if (!ceph_wbc->should_loop && !ceph_wbc->head_snapc &&
|
|
|
|
wbc->sync_mode != WB_SYNC_NONE)
|
|
|
|
ceph_wbc->should_loop = true;
|
|
|
|
|
|
|
|
return -ENODATA;
|
|
|
|
}
|
|
|
|
|
2025-02-17 18:51:14 +00:00
|
|
|
if (folio_pos(folio) >= ceph_wbc->i_size) {
|
ceph: introduce ceph_process_folio_batch() method
First step of ceph_writepages_start() logic is
of finding the dirty memory folios and processing it.
This patch introduces ceph_process_folio_batch()
method that moves this logic into dedicated method.
The ceph_writepages_start() has this logic:
if (ceph_wbc.locked_pages == 0)
lock_page(page); /* first page */
else if (!trylock_page(page))
break;
<skipped>
if (folio_test_writeback(folio) ||
folio_test_private_2(folio) /* [DEPRECATED] */) {
if (wbc->sync_mode == WB_SYNC_NONE) {
doutc(cl, "%p under writeback\n", folio);
folio_unlock(folio);
continue;
}
doutc(cl, "waiting on writeback %p\n", folio);
folio_wait_writeback(folio);
folio_wait_private_2(folio); /* [DEPRECATED] */
}
The problem here that folio/page is locked here at first
and it is by set_page_writeback(page) later before
submitting the write request. The folio/page is unlocked
by writepages_finish() after finishing the write
request. It means that logic of checking folio_test_writeback()
and folio_wait_writeback() never works because page is locked
and it cannot be locked again until write request completion.
However, for majority of folios/pages the trylock_page()
is used. As a result, multiple threads can try to lock the same
folios/pages multiple times even if they are under writeback
already. It makes this logic more compute intensive than
it is necessary.
This patch changes this logic:
if (folio_test_writeback(folio) ||
folio_test_private_2(folio) /* [DEPRECATED] */) {
if (wbc->sync_mode == WB_SYNC_NONE) {
doutc(cl, "%p under writeback\n", folio);
folio_unlock(folio);
continue;
}
doutc(cl, "waiting on writeback %p\n", folio);
folio_wait_writeback(folio);
folio_wait_private_2(folio); /* [DEPRECATED] */
}
if (ceph_wbc.locked_pages == 0)
lock_page(page); /* first page */
else if (!trylock_page(page))
break;
This logic should exclude the ignoring of writeback
state of folios/pages.
Signed-off-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Link: https://lore.kernel.org/r/20250205000249.123054-3-slava@dubeyko.com
Tested-by: David Howells <dhowells@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-04 16:02:47 -08:00
|
|
|
doutc(cl, "folio at %lu beyond eof %llu\n",
|
|
|
|
folio->index, ceph_wbc->i_size);
|
|
|
|
|
|
|
|
if ((ceph_wbc->size_stable ||
|
|
|
|
folio_pos(folio) >= i_size_read(inode)) &&
|
|
|
|
folio_clear_dirty_for_io(folio))
|
|
|
|
folio_invalidate(folio, 0, folio_size(folio));
|
|
|
|
|
|
|
|
return -ENODATA;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (ceph_wbc->strip_unit_end &&
|
2025-02-17 18:51:14 +00:00
|
|
|
(folio->index > ceph_wbc->strip_unit_end)) {
|
|
|
|
doutc(cl, "end of strip unit %p\n", folio);
|
ceph: introduce ceph_process_folio_batch() method
First step of ceph_writepages_start() logic is
of finding the dirty memory folios and processing it.
This patch introduces ceph_process_folio_batch()
method that moves this logic into dedicated method.
The ceph_writepages_start() has this logic:
if (ceph_wbc.locked_pages == 0)
lock_page(page); /* first page */
else if (!trylock_page(page))
break;
<skipped>
if (folio_test_writeback(folio) ||
folio_test_private_2(folio) /* [DEPRECATED] */) {
if (wbc->sync_mode == WB_SYNC_NONE) {
doutc(cl, "%p under writeback\n", folio);
folio_unlock(folio);
continue;
}
doutc(cl, "waiting on writeback %p\n", folio);
folio_wait_writeback(folio);
folio_wait_private_2(folio); /* [DEPRECATED] */
}
The problem here that folio/page is locked here at first
and it is by set_page_writeback(page) later before
submitting the write request. The folio/page is unlocked
by writepages_finish() after finishing the write
request. It means that logic of checking folio_test_writeback()
and folio_wait_writeback() never works because page is locked
and it cannot be locked again until write request completion.
However, for majority of folios/pages the trylock_page()
is used. As a result, multiple threads can try to lock the same
folios/pages multiple times even if they are under writeback
already. It makes this logic more compute intensive than
it is necessary.
This patch changes this logic:
if (folio_test_writeback(folio) ||
folio_test_private_2(folio) /* [DEPRECATED] */) {
if (wbc->sync_mode == WB_SYNC_NONE) {
doutc(cl, "%p under writeback\n", folio);
folio_unlock(folio);
continue;
}
doutc(cl, "waiting on writeback %p\n", folio);
folio_wait_writeback(folio);
folio_wait_private_2(folio); /* [DEPRECATED] */
}
if (ceph_wbc.locked_pages == 0)
lock_page(page); /* first page */
else if (!trylock_page(page))
break;
This logic should exclude the ignoring of writeback
state of folios/pages.
Signed-off-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Link: https://lore.kernel.org/r/20250205000249.123054-3-slava@dubeyko.com
Tested-by: David Howells <dhowells@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-04 16:02:47 -08:00
|
|
|
return -E2BIG;
|
|
|
|
}
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline
|
|
|
|
void __ceph_allocate_page_array(struct ceph_writeback_ctl *ceph_wbc,
|
|
|
|
unsigned int max_pages)
|
|
|
|
{
|
|
|
|
ceph_wbc->pages = kmalloc_array(max_pages,
|
|
|
|
sizeof(*ceph_wbc->pages),
|
|
|
|
GFP_NOFS);
|
|
|
|
if (!ceph_wbc->pages) {
|
|
|
|
ceph_wbc->from_pool = true;
|
|
|
|
ceph_wbc->pages = mempool_alloc(ceph_wb_pagevec_pool, GFP_NOFS);
|
|
|
|
BUG_ON(!ceph_wbc->pages);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline
|
|
|
|
void ceph_allocate_page_array(struct address_space *mapping,
|
|
|
|
struct ceph_writeback_ctl *ceph_wbc,
|
2025-02-17 18:51:17 +00:00
|
|
|
struct folio *folio)
|
ceph: introduce ceph_process_folio_batch() method
First step of ceph_writepages_start() logic is
of finding the dirty memory folios and processing it.
This patch introduces ceph_process_folio_batch()
method that moves this logic into dedicated method.
The ceph_writepages_start() has this logic:
if (ceph_wbc.locked_pages == 0)
lock_page(page); /* first page */
else if (!trylock_page(page))
break;
<skipped>
if (folio_test_writeback(folio) ||
folio_test_private_2(folio) /* [DEPRECATED] */) {
if (wbc->sync_mode == WB_SYNC_NONE) {
doutc(cl, "%p under writeback\n", folio);
folio_unlock(folio);
continue;
}
doutc(cl, "waiting on writeback %p\n", folio);
folio_wait_writeback(folio);
folio_wait_private_2(folio); /* [DEPRECATED] */
}
The problem here that folio/page is locked here at first
and it is by set_page_writeback(page) later before
submitting the write request. The folio/page is unlocked
by writepages_finish() after finishing the write
request. It means that logic of checking folio_test_writeback()
and folio_wait_writeback() never works because page is locked
and it cannot be locked again until write request completion.
However, for majority of folios/pages the trylock_page()
is used. As a result, multiple threads can try to lock the same
folios/pages multiple times even if they are under writeback
already. It makes this logic more compute intensive than
it is necessary.
This patch changes this logic:
if (folio_test_writeback(folio) ||
folio_test_private_2(folio) /* [DEPRECATED] */) {
if (wbc->sync_mode == WB_SYNC_NONE) {
doutc(cl, "%p under writeback\n", folio);
folio_unlock(folio);
continue;
}
doutc(cl, "waiting on writeback %p\n", folio);
folio_wait_writeback(folio);
folio_wait_private_2(folio); /* [DEPRECATED] */
}
if (ceph_wbc.locked_pages == 0)
lock_page(page); /* first page */
else if (!trylock_page(page))
break;
This logic should exclude the ignoring of writeback
state of folios/pages.
Signed-off-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Link: https://lore.kernel.org/r/20250205000249.123054-3-slava@dubeyko.com
Tested-by: David Howells <dhowells@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-04 16:02:47 -08:00
|
|
|
{
|
|
|
|
struct inode *inode = mapping->host;
|
|
|
|
struct ceph_inode_info *ci = ceph_inode(inode);
|
|
|
|
u64 objnum;
|
|
|
|
u64 objoff;
|
|
|
|
u32 xlen;
|
|
|
|
|
|
|
|
/* prepare async write request */
|
2025-02-17 18:51:17 +00:00
|
|
|
ceph_wbc->offset = (u64)folio_pos(folio);
|
ceph: introduce ceph_process_folio_batch() method
First step of ceph_writepages_start() logic is
of finding the dirty memory folios and processing it.
This patch introduces ceph_process_folio_batch()
method that moves this logic into dedicated method.
The ceph_writepages_start() has this logic:
if (ceph_wbc.locked_pages == 0)
lock_page(page); /* first page */
else if (!trylock_page(page))
break;
<skipped>
if (folio_test_writeback(folio) ||
folio_test_private_2(folio) /* [DEPRECATED] */) {
if (wbc->sync_mode == WB_SYNC_NONE) {
doutc(cl, "%p under writeback\n", folio);
folio_unlock(folio);
continue;
}
doutc(cl, "waiting on writeback %p\n", folio);
folio_wait_writeback(folio);
folio_wait_private_2(folio); /* [DEPRECATED] */
}
The problem here that folio/page is locked here at first
and it is by set_page_writeback(page) later before
submitting the write request. The folio/page is unlocked
by writepages_finish() after finishing the write
request. It means that logic of checking folio_test_writeback()
and folio_wait_writeback() never works because page is locked
and it cannot be locked again until write request completion.
However, for majority of folios/pages the trylock_page()
is used. As a result, multiple threads can try to lock the same
folios/pages multiple times even if they are under writeback
already. It makes this logic more compute intensive than
it is necessary.
This patch changes this logic:
if (folio_test_writeback(folio) ||
folio_test_private_2(folio) /* [DEPRECATED] */) {
if (wbc->sync_mode == WB_SYNC_NONE) {
doutc(cl, "%p under writeback\n", folio);
folio_unlock(folio);
continue;
}
doutc(cl, "waiting on writeback %p\n", folio);
folio_wait_writeback(folio);
folio_wait_private_2(folio); /* [DEPRECATED] */
}
if (ceph_wbc.locked_pages == 0)
lock_page(page); /* first page */
else if (!trylock_page(page))
break;
This logic should exclude the ignoring of writeback
state of folios/pages.
Signed-off-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Link: https://lore.kernel.org/r/20250205000249.123054-3-slava@dubeyko.com
Tested-by: David Howells <dhowells@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-04 16:02:47 -08:00
|
|
|
ceph_calc_file_object_mapping(&ci->i_layout,
|
|
|
|
ceph_wbc->offset, ceph_wbc->wsize,
|
|
|
|
&objnum, &objoff, &xlen);
|
|
|
|
|
|
|
|
ceph_wbc->num_ops = 1;
|
2025-02-17 18:51:17 +00:00
|
|
|
ceph_wbc->strip_unit_end = folio->index + ((xlen - 1) >> PAGE_SHIFT);
|
ceph: introduce ceph_process_folio_batch() method
First step of ceph_writepages_start() logic is
of finding the dirty memory folios and processing it.
This patch introduces ceph_process_folio_batch()
method that moves this logic into dedicated method.
The ceph_writepages_start() has this logic:
if (ceph_wbc.locked_pages == 0)
lock_page(page); /* first page */
else if (!trylock_page(page))
break;
<skipped>
if (folio_test_writeback(folio) ||
folio_test_private_2(folio) /* [DEPRECATED] */) {
if (wbc->sync_mode == WB_SYNC_NONE) {
doutc(cl, "%p under writeback\n", folio);
folio_unlock(folio);
continue;
}
doutc(cl, "waiting on writeback %p\n", folio);
folio_wait_writeback(folio);
folio_wait_private_2(folio); /* [DEPRECATED] */
}
The problem here that folio/page is locked here at first
and it is by set_page_writeback(page) later before
submitting the write request. The folio/page is unlocked
by writepages_finish() after finishing the write
request. It means that logic of checking folio_test_writeback()
and folio_wait_writeback() never works because page is locked
and it cannot be locked again until write request completion.
However, for majority of folios/pages the trylock_page()
is used. As a result, multiple threads can try to lock the same
folios/pages multiple times even if they are under writeback
already. It makes this logic more compute intensive than
it is necessary.
This patch changes this logic:
if (folio_test_writeback(folio) ||
folio_test_private_2(folio) /* [DEPRECATED] */) {
if (wbc->sync_mode == WB_SYNC_NONE) {
doutc(cl, "%p under writeback\n", folio);
folio_unlock(folio);
continue;
}
doutc(cl, "waiting on writeback %p\n", folio);
folio_wait_writeback(folio);
folio_wait_private_2(folio); /* [DEPRECATED] */
}
if (ceph_wbc.locked_pages == 0)
lock_page(page); /* first page */
else if (!trylock_page(page))
break;
This logic should exclude the ignoring of writeback
state of folios/pages.
Signed-off-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Link: https://lore.kernel.org/r/20250205000249.123054-3-slava@dubeyko.com
Tested-by: David Howells <dhowells@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-04 16:02:47 -08:00
|
|
|
|
|
|
|
BUG_ON(ceph_wbc->pages);
|
|
|
|
ceph_wbc->max_pages = calc_pages_for(0, (u64)xlen);
|
|
|
|
__ceph_allocate_page_array(ceph_wbc, ceph_wbc->max_pages);
|
|
|
|
|
|
|
|
ceph_wbc->len = 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline
|
2025-02-17 18:51:15 +00:00
|
|
|
bool is_folio_index_contiguous(const struct ceph_writeback_ctl *ceph_wbc,
|
|
|
|
const struct folio *folio)
|
ceph: introduce ceph_process_folio_batch() method
First step of ceph_writepages_start() logic is
of finding the dirty memory folios and processing it.
This patch introduces ceph_process_folio_batch()
method that moves this logic into dedicated method.
The ceph_writepages_start() has this logic:
if (ceph_wbc.locked_pages == 0)
lock_page(page); /* first page */
else if (!trylock_page(page))
break;
<skipped>
if (folio_test_writeback(folio) ||
folio_test_private_2(folio) /* [DEPRECATED] */) {
if (wbc->sync_mode == WB_SYNC_NONE) {
doutc(cl, "%p under writeback\n", folio);
folio_unlock(folio);
continue;
}
doutc(cl, "waiting on writeback %p\n", folio);
folio_wait_writeback(folio);
folio_wait_private_2(folio); /* [DEPRECATED] */
}
The problem here that folio/page is locked here at first
and it is by set_page_writeback(page) later before
submitting the write request. The folio/page is unlocked
by writepages_finish() after finishing the write
request. It means that logic of checking folio_test_writeback()
and folio_wait_writeback() never works because page is locked
and it cannot be locked again until write request completion.
However, for majority of folios/pages the trylock_page()
is used. As a result, multiple threads can try to lock the same
folios/pages multiple times even if they are under writeback
already. It makes this logic more compute intensive than
it is necessary.
This patch changes this logic:
if (folio_test_writeback(folio) ||
folio_test_private_2(folio) /* [DEPRECATED] */) {
if (wbc->sync_mode == WB_SYNC_NONE) {
doutc(cl, "%p under writeback\n", folio);
folio_unlock(folio);
continue;
}
doutc(cl, "waiting on writeback %p\n", folio);
folio_wait_writeback(folio);
folio_wait_private_2(folio); /* [DEPRECATED] */
}
if (ceph_wbc.locked_pages == 0)
lock_page(page); /* first page */
else if (!trylock_page(page))
break;
This logic should exclude the ignoring of writeback
state of folios/pages.
Signed-off-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Link: https://lore.kernel.org/r/20250205000249.123054-3-slava@dubeyko.com
Tested-by: David Howells <dhowells@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-04 16:02:47 -08:00
|
|
|
{
|
2025-02-17 18:51:15 +00:00
|
|
|
return folio->index == (ceph_wbc->offset + ceph_wbc->len) >> PAGE_SHIFT;
|
ceph: introduce ceph_process_folio_batch() method
First step of ceph_writepages_start() logic is
of finding the dirty memory folios and processing it.
This patch introduces ceph_process_folio_batch()
method that moves this logic into dedicated method.
The ceph_writepages_start() has this logic:
if (ceph_wbc.locked_pages == 0)
lock_page(page); /* first page */
else if (!trylock_page(page))
break;
<skipped>
if (folio_test_writeback(folio) ||
folio_test_private_2(folio) /* [DEPRECATED] */) {
if (wbc->sync_mode == WB_SYNC_NONE) {
doutc(cl, "%p under writeback\n", folio);
folio_unlock(folio);
continue;
}
doutc(cl, "waiting on writeback %p\n", folio);
folio_wait_writeback(folio);
folio_wait_private_2(folio); /* [DEPRECATED] */
}
The problem here that folio/page is locked here at first
and it is by set_page_writeback(page) later before
submitting the write request. The folio/page is unlocked
by writepages_finish() after finishing the write
request. It means that logic of checking folio_test_writeback()
and folio_wait_writeback() never works because page is locked
and it cannot be locked again until write request completion.
However, for majority of folios/pages the trylock_page()
is used. As a result, multiple threads can try to lock the same
folios/pages multiple times even if they are under writeback
already. It makes this logic more compute intensive than
it is necessary.
This patch changes this logic:
if (folio_test_writeback(folio) ||
folio_test_private_2(folio) /* [DEPRECATED] */) {
if (wbc->sync_mode == WB_SYNC_NONE) {
doutc(cl, "%p under writeback\n", folio);
folio_unlock(folio);
continue;
}
doutc(cl, "waiting on writeback %p\n", folio);
folio_wait_writeback(folio);
folio_wait_private_2(folio); /* [DEPRECATED] */
}
if (ceph_wbc.locked_pages == 0)
lock_page(page); /* first page */
else if (!trylock_page(page))
break;
This logic should exclude the ignoring of writeback
state of folios/pages.
Signed-off-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Link: https://lore.kernel.org/r/20250205000249.123054-3-slava@dubeyko.com
Tested-by: David Howells <dhowells@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-04 16:02:47 -08:00
|
|
|
}
|
|
|
|
|
|
|
|
static inline
|
|
|
|
bool is_num_ops_too_big(struct ceph_writeback_ctl *ceph_wbc)
|
|
|
|
{
|
|
|
|
return ceph_wbc->num_ops >=
|
|
|
|
(ceph_wbc->from_pool ? CEPH_OSD_SLAB_OPS : CEPH_OSD_MAX_OPS);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline
|
|
|
|
bool is_write_congestion_happened(struct ceph_fs_client *fsc)
|
|
|
|
{
|
|
|
|
return atomic_long_inc_return(&fsc->writeback_count) >
|
|
|
|
CONGESTION_ON_THRESH(fsc->mount_options->congestion_kb);
|
|
|
|
}
|
|
|
|
|
2025-02-17 18:51:16 +00:00
|
|
|
static inline int move_dirty_folio_in_page_array(struct address_space *mapping,
|
|
|
|
struct writeback_control *wbc,
|
|
|
|
struct ceph_writeback_ctl *ceph_wbc, struct folio *folio)
|
ceph: introduce ceph_process_folio_batch() method
First step of ceph_writepages_start() logic is
of finding the dirty memory folios and processing it.
This patch introduces ceph_process_folio_batch()
method that moves this logic into dedicated method.
The ceph_writepages_start() has this logic:
if (ceph_wbc.locked_pages == 0)
lock_page(page); /* first page */
else if (!trylock_page(page))
break;
<skipped>
if (folio_test_writeback(folio) ||
folio_test_private_2(folio) /* [DEPRECATED] */) {
if (wbc->sync_mode == WB_SYNC_NONE) {
doutc(cl, "%p under writeback\n", folio);
folio_unlock(folio);
continue;
}
doutc(cl, "waiting on writeback %p\n", folio);
folio_wait_writeback(folio);
folio_wait_private_2(folio); /* [DEPRECATED] */
}
The problem here that folio/page is locked here at first
and it is by set_page_writeback(page) later before
submitting the write request. The folio/page is unlocked
by writepages_finish() after finishing the write
request. It means that logic of checking folio_test_writeback()
and folio_wait_writeback() never works because page is locked
and it cannot be locked again until write request completion.
However, for majority of folios/pages the trylock_page()
is used. As a result, multiple threads can try to lock the same
folios/pages multiple times even if they are under writeback
already. It makes this logic more compute intensive than
it is necessary.
This patch changes this logic:
if (folio_test_writeback(folio) ||
folio_test_private_2(folio) /* [DEPRECATED] */) {
if (wbc->sync_mode == WB_SYNC_NONE) {
doutc(cl, "%p under writeback\n", folio);
folio_unlock(folio);
continue;
}
doutc(cl, "waiting on writeback %p\n", folio);
folio_wait_writeback(folio);
folio_wait_private_2(folio); /* [DEPRECATED] */
}
if (ceph_wbc.locked_pages == 0)
lock_page(page); /* first page */
else if (!trylock_page(page))
break;
This logic should exclude the ignoring of writeback
state of folios/pages.
Signed-off-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Link: https://lore.kernel.org/r/20250205000249.123054-3-slava@dubeyko.com
Tested-by: David Howells <dhowells@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-04 16:02:47 -08:00
|
|
|
{
|
|
|
|
struct inode *inode = mapping->host;
|
|
|
|
struct ceph_fs_client *fsc = ceph_inode_to_fs_client(inode);
|
|
|
|
struct ceph_client *cl = fsc->client;
|
|
|
|
struct page **pages = ceph_wbc->pages;
|
|
|
|
unsigned int index = ceph_wbc->locked_pages;
|
|
|
|
gfp_t gfp_flags = ceph_wbc->locked_pages ? GFP_NOWAIT : GFP_NOFS;
|
|
|
|
|
|
|
|
if (IS_ENCRYPTED(inode)) {
|
2025-03-04 17:02:23 +00:00
|
|
|
pages[index] = fscrypt_encrypt_pagecache_blocks(folio,
|
ceph: introduce ceph_process_folio_batch() method
First step of ceph_writepages_start() logic is
of finding the dirty memory folios and processing it.
This patch introduces ceph_process_folio_batch()
method that moves this logic into dedicated method.
The ceph_writepages_start() has this logic:
if (ceph_wbc.locked_pages == 0)
lock_page(page); /* first page */
else if (!trylock_page(page))
break;
<skipped>
if (folio_test_writeback(folio) ||
folio_test_private_2(folio) /* [DEPRECATED] */) {
if (wbc->sync_mode == WB_SYNC_NONE) {
doutc(cl, "%p under writeback\n", folio);
folio_unlock(folio);
continue;
}
doutc(cl, "waiting on writeback %p\n", folio);
folio_wait_writeback(folio);
folio_wait_private_2(folio); /* [DEPRECATED] */
}
The problem here that folio/page is locked here at first
and it is by set_page_writeback(page) later before
submitting the write request. The folio/page is unlocked
by writepages_finish() after finishing the write
request. It means that logic of checking folio_test_writeback()
and folio_wait_writeback() never works because page is locked
and it cannot be locked again until write request completion.
However, for majority of folios/pages the trylock_page()
is used. As a result, multiple threads can try to lock the same
folios/pages multiple times even if they are under writeback
already. It makes this logic more compute intensive than
it is necessary.
This patch changes this logic:
if (folio_test_writeback(folio) ||
folio_test_private_2(folio) /* [DEPRECATED] */) {
if (wbc->sync_mode == WB_SYNC_NONE) {
doutc(cl, "%p under writeback\n", folio);
folio_unlock(folio);
continue;
}
doutc(cl, "waiting on writeback %p\n", folio);
folio_wait_writeback(folio);
folio_wait_private_2(folio); /* [DEPRECATED] */
}
if (ceph_wbc.locked_pages == 0)
lock_page(page); /* first page */
else if (!trylock_page(page))
break;
This logic should exclude the ignoring of writeback
state of folios/pages.
Signed-off-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Link: https://lore.kernel.org/r/20250205000249.123054-3-slava@dubeyko.com
Tested-by: David Howells <dhowells@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-04 16:02:47 -08:00
|
|
|
PAGE_SIZE,
|
|
|
|
0,
|
|
|
|
gfp_flags);
|
|
|
|
if (IS_ERR(pages[index])) {
|
|
|
|
if (PTR_ERR(pages[index]) == -EINVAL) {
|
|
|
|
pr_err_client(cl, "inode->i_blkbits=%hhu\n",
|
|
|
|
inode->i_blkbits);
|
|
|
|
}
|
|
|
|
|
|
|
|
/* better not fail on first page! */
|
|
|
|
BUG_ON(ceph_wbc->locked_pages == 0);
|
|
|
|
|
|
|
|
pages[index] = NULL;
|
|
|
|
return PTR_ERR(pages[index]);
|
|
|
|
}
|
|
|
|
} else {
|
2025-02-17 18:51:16 +00:00
|
|
|
pages[index] = &folio->page;
|
ceph: introduce ceph_process_folio_batch() method
First step of ceph_writepages_start() logic is
of finding the dirty memory folios and processing it.
This patch introduces ceph_process_folio_batch()
method that moves this logic into dedicated method.
The ceph_writepages_start() has this logic:
if (ceph_wbc.locked_pages == 0)
lock_page(page); /* first page */
else if (!trylock_page(page))
break;
<skipped>
if (folio_test_writeback(folio) ||
folio_test_private_2(folio) /* [DEPRECATED] */) {
if (wbc->sync_mode == WB_SYNC_NONE) {
doutc(cl, "%p under writeback\n", folio);
folio_unlock(folio);
continue;
}
doutc(cl, "waiting on writeback %p\n", folio);
folio_wait_writeback(folio);
folio_wait_private_2(folio); /* [DEPRECATED] */
}
The problem here that folio/page is locked here at first
and it is by set_page_writeback(page) later before
submitting the write request. The folio/page is unlocked
by writepages_finish() after finishing the write
request. It means that logic of checking folio_test_writeback()
and folio_wait_writeback() never works because page is locked
and it cannot be locked again until write request completion.
However, for majority of folios/pages the trylock_page()
is used. As a result, multiple threads can try to lock the same
folios/pages multiple times even if they are under writeback
already. It makes this logic more compute intensive than
it is necessary.
This patch changes this logic:
if (folio_test_writeback(folio) ||
folio_test_private_2(folio) /* [DEPRECATED] */) {
if (wbc->sync_mode == WB_SYNC_NONE) {
doutc(cl, "%p under writeback\n", folio);
folio_unlock(folio);
continue;
}
doutc(cl, "waiting on writeback %p\n", folio);
folio_wait_writeback(folio);
folio_wait_private_2(folio); /* [DEPRECATED] */
}
if (ceph_wbc.locked_pages == 0)
lock_page(page); /* first page */
else if (!trylock_page(page))
break;
This logic should exclude the ignoring of writeback
state of folios/pages.
Signed-off-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Link: https://lore.kernel.org/r/20250205000249.123054-3-slava@dubeyko.com
Tested-by: David Howells <dhowells@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-04 16:02:47 -08:00
|
|
|
}
|
|
|
|
|
|
|
|
ceph_wbc->locked_pages++;
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static
|
|
|
|
int ceph_process_folio_batch(struct address_space *mapping,
|
|
|
|
struct writeback_control *wbc,
|
|
|
|
struct ceph_writeback_ctl *ceph_wbc)
|
|
|
|
{
|
|
|
|
struct inode *inode = mapping->host;
|
|
|
|
struct ceph_fs_client *fsc = ceph_inode_to_fs_client(inode);
|
|
|
|
struct ceph_client *cl = fsc->client;
|
|
|
|
struct folio *folio = NULL;
|
|
|
|
unsigned i;
|
|
|
|
int rc = 0;
|
|
|
|
|
|
|
|
for (i = 0; can_next_page_be_processed(ceph_wbc, i); i++) {
|
|
|
|
folio = ceph_wbc->fbatch.folios[i];
|
|
|
|
|
|
|
|
if (!folio)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
doutc(cl, "? %p idx %lu, folio_test_writeback %#x, "
|
|
|
|
"folio_test_dirty %#x, folio_test_locked %#x\n",
|
2025-02-17 18:51:15 +00:00
|
|
|
folio, folio->index, folio_test_writeback(folio),
|
ceph: introduce ceph_process_folio_batch() method
First step of ceph_writepages_start() logic is
of finding the dirty memory folios and processing it.
This patch introduces ceph_process_folio_batch()
method that moves this logic into dedicated method.
The ceph_writepages_start() has this logic:
if (ceph_wbc.locked_pages == 0)
lock_page(page); /* first page */
else if (!trylock_page(page))
break;
<skipped>
if (folio_test_writeback(folio) ||
folio_test_private_2(folio) /* [DEPRECATED] */) {
if (wbc->sync_mode == WB_SYNC_NONE) {
doutc(cl, "%p under writeback\n", folio);
folio_unlock(folio);
continue;
}
doutc(cl, "waiting on writeback %p\n", folio);
folio_wait_writeback(folio);
folio_wait_private_2(folio); /* [DEPRECATED] */
}
The problem here that folio/page is locked here at first
and it is by set_page_writeback(page) later before
submitting the write request. The folio/page is unlocked
by writepages_finish() after finishing the write
request. It means that logic of checking folio_test_writeback()
and folio_wait_writeback() never works because page is locked
and it cannot be locked again until write request completion.
However, for majority of folios/pages the trylock_page()
is used. As a result, multiple threads can try to lock the same
folios/pages multiple times even if they are under writeback
already. It makes this logic more compute intensive than
it is necessary.
This patch changes this logic:
if (folio_test_writeback(folio) ||
folio_test_private_2(folio) /* [DEPRECATED] */) {
if (wbc->sync_mode == WB_SYNC_NONE) {
doutc(cl, "%p under writeback\n", folio);
folio_unlock(folio);
continue;
}
doutc(cl, "waiting on writeback %p\n", folio);
folio_wait_writeback(folio);
folio_wait_private_2(folio); /* [DEPRECATED] */
}
if (ceph_wbc.locked_pages == 0)
lock_page(page); /* first page */
else if (!trylock_page(page))
break;
This logic should exclude the ignoring of writeback
state of folios/pages.
Signed-off-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Link: https://lore.kernel.org/r/20250205000249.123054-3-slava@dubeyko.com
Tested-by: David Howells <dhowells@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-04 16:02:47 -08:00
|
|
|
folio_test_dirty(folio),
|
|
|
|
folio_test_locked(folio));
|
|
|
|
|
|
|
|
if (folio_test_writeback(folio) ||
|
|
|
|
folio_test_private_2(folio) /* [DEPRECATED] */) {
|
|
|
|
doutc(cl, "waiting on writeback %p\n", folio);
|
|
|
|
folio_wait_writeback(folio);
|
|
|
|
folio_wait_private_2(folio); /* [DEPRECATED] */
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (ceph_wbc->locked_pages == 0)
|
2025-02-17 18:51:15 +00:00
|
|
|
folio_lock(folio);
|
|
|
|
else if (!folio_trylock(folio))
|
ceph: introduce ceph_process_folio_batch() method
First step of ceph_writepages_start() logic is
of finding the dirty memory folios and processing it.
This patch introduces ceph_process_folio_batch()
method that moves this logic into dedicated method.
The ceph_writepages_start() has this logic:
if (ceph_wbc.locked_pages == 0)
lock_page(page); /* first page */
else if (!trylock_page(page))
break;
<skipped>
if (folio_test_writeback(folio) ||
folio_test_private_2(folio) /* [DEPRECATED] */) {
if (wbc->sync_mode == WB_SYNC_NONE) {
doutc(cl, "%p under writeback\n", folio);
folio_unlock(folio);
continue;
}
doutc(cl, "waiting on writeback %p\n", folio);
folio_wait_writeback(folio);
folio_wait_private_2(folio); /* [DEPRECATED] */
}
The problem here that folio/page is locked here at first
and it is by set_page_writeback(page) later before
submitting the write request. The folio/page is unlocked
by writepages_finish() after finishing the write
request. It means that logic of checking folio_test_writeback()
and folio_wait_writeback() never works because page is locked
and it cannot be locked again until write request completion.
However, for majority of folios/pages the trylock_page()
is used. As a result, multiple threads can try to lock the same
folios/pages multiple times even if they are under writeback
already. It makes this logic more compute intensive than
it is necessary.
This patch changes this logic:
if (folio_test_writeback(folio) ||
folio_test_private_2(folio) /* [DEPRECATED] */) {
if (wbc->sync_mode == WB_SYNC_NONE) {
doutc(cl, "%p under writeback\n", folio);
folio_unlock(folio);
continue;
}
doutc(cl, "waiting on writeback %p\n", folio);
folio_wait_writeback(folio);
folio_wait_private_2(folio); /* [DEPRECATED] */
}
if (ceph_wbc.locked_pages == 0)
lock_page(page); /* first page */
else if (!trylock_page(page))
break;
This logic should exclude the ignoring of writeback
state of folios/pages.
Signed-off-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Link: https://lore.kernel.org/r/20250205000249.123054-3-slava@dubeyko.com
Tested-by: David Howells <dhowells@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-04 16:02:47 -08:00
|
|
|
break;
|
|
|
|
|
|
|
|
rc = ceph_check_page_before_write(mapping, wbc,
|
|
|
|
ceph_wbc, folio);
|
|
|
|
if (rc == -ENODATA) {
|
|
|
|
rc = 0;
|
2025-02-17 18:51:15 +00:00
|
|
|
folio_unlock(folio);
|
ceph: introduce ceph_process_folio_batch() method
First step of ceph_writepages_start() logic is
of finding the dirty memory folios and processing it.
This patch introduces ceph_process_folio_batch()
method that moves this logic into dedicated method.
The ceph_writepages_start() has this logic:
if (ceph_wbc.locked_pages == 0)
lock_page(page); /* first page */
else if (!trylock_page(page))
break;
<skipped>
if (folio_test_writeback(folio) ||
folio_test_private_2(folio) /* [DEPRECATED] */) {
if (wbc->sync_mode == WB_SYNC_NONE) {
doutc(cl, "%p under writeback\n", folio);
folio_unlock(folio);
continue;
}
doutc(cl, "waiting on writeback %p\n", folio);
folio_wait_writeback(folio);
folio_wait_private_2(folio); /* [DEPRECATED] */
}
The problem here that folio/page is locked here at first
and it is by set_page_writeback(page) later before
submitting the write request. The folio/page is unlocked
by writepages_finish() after finishing the write
request. It means that logic of checking folio_test_writeback()
and folio_wait_writeback() never works because page is locked
and it cannot be locked again until write request completion.
However, for majority of folios/pages the trylock_page()
is used. As a result, multiple threads can try to lock the same
folios/pages multiple times even if they are under writeback
already. It makes this logic more compute intensive than
it is necessary.
This patch changes this logic:
if (folio_test_writeback(folio) ||
folio_test_private_2(folio) /* [DEPRECATED] */) {
if (wbc->sync_mode == WB_SYNC_NONE) {
doutc(cl, "%p under writeback\n", folio);
folio_unlock(folio);
continue;
}
doutc(cl, "waiting on writeback %p\n", folio);
folio_wait_writeback(folio);
folio_wait_private_2(folio); /* [DEPRECATED] */
}
if (ceph_wbc.locked_pages == 0)
lock_page(page); /* first page */
else if (!trylock_page(page))
break;
This logic should exclude the ignoring of writeback
state of folios/pages.
Signed-off-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Link: https://lore.kernel.org/r/20250205000249.123054-3-slava@dubeyko.com
Tested-by: David Howells <dhowells@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-04 16:02:47 -08:00
|
|
|
ceph_wbc->fbatch.folios[i] = NULL;
|
|
|
|
continue;
|
|
|
|
} else if (rc == -E2BIG) {
|
|
|
|
rc = 0;
|
2025-02-17 18:51:15 +00:00
|
|
|
folio_unlock(folio);
|
ceph: introduce ceph_process_folio_batch() method
First step of ceph_writepages_start() logic is
of finding the dirty memory folios and processing it.
This patch introduces ceph_process_folio_batch()
method that moves this logic into dedicated method.
The ceph_writepages_start() has this logic:
if (ceph_wbc.locked_pages == 0)
lock_page(page); /* first page */
else if (!trylock_page(page))
break;
<skipped>
if (folio_test_writeback(folio) ||
folio_test_private_2(folio) /* [DEPRECATED] */) {
if (wbc->sync_mode == WB_SYNC_NONE) {
doutc(cl, "%p under writeback\n", folio);
folio_unlock(folio);
continue;
}
doutc(cl, "waiting on writeback %p\n", folio);
folio_wait_writeback(folio);
folio_wait_private_2(folio); /* [DEPRECATED] */
}
The problem here that folio/page is locked here at first
and it is by set_page_writeback(page) later before
submitting the write request. The folio/page is unlocked
by writepages_finish() after finishing the write
request. It means that logic of checking folio_test_writeback()
and folio_wait_writeback() never works because page is locked
and it cannot be locked again until write request completion.
However, for majority of folios/pages the trylock_page()
is used. As a result, multiple threads can try to lock the same
folios/pages multiple times even if they are under writeback
already. It makes this logic more compute intensive than
it is necessary.
This patch changes this logic:
if (folio_test_writeback(folio) ||
folio_test_private_2(folio) /* [DEPRECATED] */) {
if (wbc->sync_mode == WB_SYNC_NONE) {
doutc(cl, "%p under writeback\n", folio);
folio_unlock(folio);
continue;
}
doutc(cl, "waiting on writeback %p\n", folio);
folio_wait_writeback(folio);
folio_wait_private_2(folio); /* [DEPRECATED] */
}
if (ceph_wbc.locked_pages == 0)
lock_page(page); /* first page */
else if (!trylock_page(page))
break;
This logic should exclude the ignoring of writeback
state of folios/pages.
Signed-off-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Link: https://lore.kernel.org/r/20250205000249.123054-3-slava@dubeyko.com
Tested-by: David Howells <dhowells@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-04 16:02:47 -08:00
|
|
|
ceph_wbc->fbatch.folios[i] = NULL;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
2025-02-17 18:51:15 +00:00
|
|
|
if (!folio_clear_dirty_for_io(folio)) {
|
|
|
|
doutc(cl, "%p !folio_clear_dirty_for_io\n", folio);
|
|
|
|
folio_unlock(folio);
|
ceph: introduce ceph_process_folio_batch() method
First step of ceph_writepages_start() logic is
of finding the dirty memory folios and processing it.
This patch introduces ceph_process_folio_batch()
method that moves this logic into dedicated method.
The ceph_writepages_start() has this logic:
if (ceph_wbc.locked_pages == 0)
lock_page(page); /* first page */
else if (!trylock_page(page))
break;
<skipped>
if (folio_test_writeback(folio) ||
folio_test_private_2(folio) /* [DEPRECATED] */) {
if (wbc->sync_mode == WB_SYNC_NONE) {
doutc(cl, "%p under writeback\n", folio);
folio_unlock(folio);
continue;
}
doutc(cl, "waiting on writeback %p\n", folio);
folio_wait_writeback(folio);
folio_wait_private_2(folio); /* [DEPRECATED] */
}
The problem here that folio/page is locked here at first
and it is by set_page_writeback(page) later before
submitting the write request. The folio/page is unlocked
by writepages_finish() after finishing the write
request. It means that logic of checking folio_test_writeback()
and folio_wait_writeback() never works because page is locked
and it cannot be locked again until write request completion.
However, for majority of folios/pages the trylock_page()
is used. As a result, multiple threads can try to lock the same
folios/pages multiple times even if they are under writeback
already. It makes this logic more compute intensive than
it is necessary.
This patch changes this logic:
if (folio_test_writeback(folio) ||
folio_test_private_2(folio) /* [DEPRECATED] */) {
if (wbc->sync_mode == WB_SYNC_NONE) {
doutc(cl, "%p under writeback\n", folio);
folio_unlock(folio);
continue;
}
doutc(cl, "waiting on writeback %p\n", folio);
folio_wait_writeback(folio);
folio_wait_private_2(folio); /* [DEPRECATED] */
}
if (ceph_wbc.locked_pages == 0)
lock_page(page); /* first page */
else if (!trylock_page(page))
break;
This logic should exclude the ignoring of writeback
state of folios/pages.
Signed-off-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Link: https://lore.kernel.org/r/20250205000249.123054-3-slava@dubeyko.com
Tested-by: David Howells <dhowells@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-04 16:02:47 -08:00
|
|
|
ceph_wbc->fbatch.folios[i] = NULL;
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We have something to write. If this is
|
|
|
|
* the first locked page this time through,
|
|
|
|
* calculate max possible write size and
|
|
|
|
* allocate a page array
|
|
|
|
*/
|
|
|
|
if (ceph_wbc->locked_pages == 0) {
|
2025-02-17 18:51:17 +00:00
|
|
|
ceph_allocate_page_array(mapping, ceph_wbc, folio);
|
2025-02-17 18:51:15 +00:00
|
|
|
} else if (!is_folio_index_contiguous(ceph_wbc, folio)) {
|
ceph: introduce ceph_process_folio_batch() method
First step of ceph_writepages_start() logic is
of finding the dirty memory folios and processing it.
This patch introduces ceph_process_folio_batch()
method that moves this logic into dedicated method.
The ceph_writepages_start() has this logic:
if (ceph_wbc.locked_pages == 0)
lock_page(page); /* first page */
else if (!trylock_page(page))
break;
<skipped>
if (folio_test_writeback(folio) ||
folio_test_private_2(folio) /* [DEPRECATED] */) {
if (wbc->sync_mode == WB_SYNC_NONE) {
doutc(cl, "%p under writeback\n", folio);
folio_unlock(folio);
continue;
}
doutc(cl, "waiting on writeback %p\n", folio);
folio_wait_writeback(folio);
folio_wait_private_2(folio); /* [DEPRECATED] */
}
The problem here that folio/page is locked here at first
and it is by set_page_writeback(page) later before
submitting the write request. The folio/page is unlocked
by writepages_finish() after finishing the write
request. It means that logic of checking folio_test_writeback()
and folio_wait_writeback() never works because page is locked
and it cannot be locked again until write request completion.
However, for majority of folios/pages the trylock_page()
is used. As a result, multiple threads can try to lock the same
folios/pages multiple times even if they are under writeback
already. It makes this logic more compute intensive than
it is necessary.
This patch changes this logic:
if (folio_test_writeback(folio) ||
folio_test_private_2(folio) /* [DEPRECATED] */) {
if (wbc->sync_mode == WB_SYNC_NONE) {
doutc(cl, "%p under writeback\n", folio);
folio_unlock(folio);
continue;
}
doutc(cl, "waiting on writeback %p\n", folio);
folio_wait_writeback(folio);
folio_wait_private_2(folio); /* [DEPRECATED] */
}
if (ceph_wbc.locked_pages == 0)
lock_page(page); /* first page */
else if (!trylock_page(page))
break;
This logic should exclude the ignoring of writeback
state of folios/pages.
Signed-off-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Link: https://lore.kernel.org/r/20250205000249.123054-3-slava@dubeyko.com
Tested-by: David Howells <dhowells@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-04 16:02:47 -08:00
|
|
|
if (is_num_ops_too_big(ceph_wbc)) {
|
2025-02-17 18:51:15 +00:00
|
|
|
folio_redirty_for_writepage(wbc, folio);
|
|
|
|
folio_unlock(folio);
|
ceph: introduce ceph_process_folio_batch() method
First step of ceph_writepages_start() logic is
of finding the dirty memory folios and processing it.
This patch introduces ceph_process_folio_batch()
method that moves this logic into dedicated method.
The ceph_writepages_start() has this logic:
if (ceph_wbc.locked_pages == 0)
lock_page(page); /* first page */
else if (!trylock_page(page))
break;
<skipped>
if (folio_test_writeback(folio) ||
folio_test_private_2(folio) /* [DEPRECATED] */) {
if (wbc->sync_mode == WB_SYNC_NONE) {
doutc(cl, "%p under writeback\n", folio);
folio_unlock(folio);
continue;
}
doutc(cl, "waiting on writeback %p\n", folio);
folio_wait_writeback(folio);
folio_wait_private_2(folio); /* [DEPRECATED] */
}
The problem here that folio/page is locked here at first
and it is by set_page_writeback(page) later before
submitting the write request. The folio/page is unlocked
by writepages_finish() after finishing the write
request. It means that logic of checking folio_test_writeback()
and folio_wait_writeback() never works because page is locked
and it cannot be locked again until write request completion.
However, for majority of folios/pages the trylock_page()
is used. As a result, multiple threads can try to lock the same
folios/pages multiple times even if they are under writeback
already. It makes this logic more compute intensive than
it is necessary.
This patch changes this logic:
if (folio_test_writeback(folio) ||
folio_test_private_2(folio) /* [DEPRECATED] */) {
if (wbc->sync_mode == WB_SYNC_NONE) {
doutc(cl, "%p under writeback\n", folio);
folio_unlock(folio);
continue;
}
doutc(cl, "waiting on writeback %p\n", folio);
folio_wait_writeback(folio);
folio_wait_private_2(folio); /* [DEPRECATED] */
}
if (ceph_wbc.locked_pages == 0)
lock_page(page); /* first page */
else if (!trylock_page(page))
break;
This logic should exclude the ignoring of writeback
state of folios/pages.
Signed-off-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Link: https://lore.kernel.org/r/20250205000249.123054-3-slava@dubeyko.com
Tested-by: David Howells <dhowells@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-04 16:02:47 -08:00
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
ceph_wbc->num_ops++;
|
2025-02-17 18:51:15 +00:00
|
|
|
ceph_wbc->offset = (u64)folio_pos(folio);
|
ceph: introduce ceph_process_folio_batch() method
First step of ceph_writepages_start() logic is
of finding the dirty memory folios and processing it.
This patch introduces ceph_process_folio_batch()
method that moves this logic into dedicated method.
The ceph_writepages_start() has this logic:
if (ceph_wbc.locked_pages == 0)
lock_page(page); /* first page */
else if (!trylock_page(page))
break;
<skipped>
if (folio_test_writeback(folio) ||
folio_test_private_2(folio) /* [DEPRECATED] */) {
if (wbc->sync_mode == WB_SYNC_NONE) {
doutc(cl, "%p under writeback\n", folio);
folio_unlock(folio);
continue;
}
doutc(cl, "waiting on writeback %p\n", folio);
folio_wait_writeback(folio);
folio_wait_private_2(folio); /* [DEPRECATED] */
}
The problem here that folio/page is locked here at first
and it is by set_page_writeback(page) later before
submitting the write request. The folio/page is unlocked
by writepages_finish() after finishing the write
request. It means that logic of checking folio_test_writeback()
and folio_wait_writeback() never works because page is locked
and it cannot be locked again until write request completion.
However, for majority of folios/pages the trylock_page()
is used. As a result, multiple threads can try to lock the same
folios/pages multiple times even if they are under writeback
already. It makes this logic more compute intensive than
it is necessary.
This patch changes this logic:
if (folio_test_writeback(folio) ||
folio_test_private_2(folio) /* [DEPRECATED] */) {
if (wbc->sync_mode == WB_SYNC_NONE) {
doutc(cl, "%p under writeback\n", folio);
folio_unlock(folio);
continue;
}
doutc(cl, "waiting on writeback %p\n", folio);
folio_wait_writeback(folio);
folio_wait_private_2(folio); /* [DEPRECATED] */
}
if (ceph_wbc.locked_pages == 0)
lock_page(page); /* first page */
else if (!trylock_page(page))
break;
This logic should exclude the ignoring of writeback
state of folios/pages.
Signed-off-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Link: https://lore.kernel.org/r/20250205000249.123054-3-slava@dubeyko.com
Tested-by: David Howells <dhowells@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-04 16:02:47 -08:00
|
|
|
ceph_wbc->len = 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* note position of first page in fbatch */
|
2025-02-17 18:51:15 +00:00
|
|
|
doutc(cl, "%llx.%llx will write folio %p idx %lu\n",
|
|
|
|
ceph_vinop(inode), folio, folio->index);
|
ceph: introduce ceph_process_folio_batch() method
First step of ceph_writepages_start() logic is
of finding the dirty memory folios and processing it.
This patch introduces ceph_process_folio_batch()
method that moves this logic into dedicated method.
The ceph_writepages_start() has this logic:
if (ceph_wbc.locked_pages == 0)
lock_page(page); /* first page */
else if (!trylock_page(page))
break;
<skipped>
if (folio_test_writeback(folio) ||
folio_test_private_2(folio) /* [DEPRECATED] */) {
if (wbc->sync_mode == WB_SYNC_NONE) {
doutc(cl, "%p under writeback\n", folio);
folio_unlock(folio);
continue;
}
doutc(cl, "waiting on writeback %p\n", folio);
folio_wait_writeback(folio);
folio_wait_private_2(folio); /* [DEPRECATED] */
}
The problem here that folio/page is locked here at first
and it is by set_page_writeback(page) later before
submitting the write request. The folio/page is unlocked
by writepages_finish() after finishing the write
request. It means that logic of checking folio_test_writeback()
and folio_wait_writeback() never works because page is locked
and it cannot be locked again until write request completion.
However, for majority of folios/pages the trylock_page()
is used. As a result, multiple threads can try to lock the same
folios/pages multiple times even if they are under writeback
already. It makes this logic more compute intensive than
it is necessary.
This patch changes this logic:
if (folio_test_writeback(folio) ||
folio_test_private_2(folio) /* [DEPRECATED] */) {
if (wbc->sync_mode == WB_SYNC_NONE) {
doutc(cl, "%p under writeback\n", folio);
folio_unlock(folio);
continue;
}
doutc(cl, "waiting on writeback %p\n", folio);
folio_wait_writeback(folio);
folio_wait_private_2(folio); /* [DEPRECATED] */
}
if (ceph_wbc.locked_pages == 0)
lock_page(page); /* first page */
else if (!trylock_page(page))
break;
This logic should exclude the ignoring of writeback
state of folios/pages.
Signed-off-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Link: https://lore.kernel.org/r/20250205000249.123054-3-slava@dubeyko.com
Tested-by: David Howells <dhowells@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-04 16:02:47 -08:00
|
|
|
|
|
|
|
fsc->write_congested = is_write_congestion_happened(fsc);
|
|
|
|
|
2025-02-17 18:51:16 +00:00
|
|
|
rc = move_dirty_folio_in_page_array(mapping, wbc, ceph_wbc,
|
|
|
|
folio);
|
ceph: introduce ceph_process_folio_batch() method
First step of ceph_writepages_start() logic is
of finding the dirty memory folios and processing it.
This patch introduces ceph_process_folio_batch()
method that moves this logic into dedicated method.
The ceph_writepages_start() has this logic:
if (ceph_wbc.locked_pages == 0)
lock_page(page); /* first page */
else if (!trylock_page(page))
break;
<skipped>
if (folio_test_writeback(folio) ||
folio_test_private_2(folio) /* [DEPRECATED] */) {
if (wbc->sync_mode == WB_SYNC_NONE) {
doutc(cl, "%p under writeback\n", folio);
folio_unlock(folio);
continue;
}
doutc(cl, "waiting on writeback %p\n", folio);
folio_wait_writeback(folio);
folio_wait_private_2(folio); /* [DEPRECATED] */
}
The problem here that folio/page is locked here at first
and it is by set_page_writeback(page) later before
submitting the write request. The folio/page is unlocked
by writepages_finish() after finishing the write
request. It means that logic of checking folio_test_writeback()
and folio_wait_writeback() never works because page is locked
and it cannot be locked again until write request completion.
However, for majority of folios/pages the trylock_page()
is used. As a result, multiple threads can try to lock the same
folios/pages multiple times even if they are under writeback
already. It makes this logic more compute intensive than
it is necessary.
This patch changes this logic:
if (folio_test_writeback(folio) ||
folio_test_private_2(folio) /* [DEPRECATED] */) {
if (wbc->sync_mode == WB_SYNC_NONE) {
doutc(cl, "%p under writeback\n", folio);
folio_unlock(folio);
continue;
}
doutc(cl, "waiting on writeback %p\n", folio);
folio_wait_writeback(folio);
folio_wait_private_2(folio); /* [DEPRECATED] */
}
if (ceph_wbc.locked_pages == 0)
lock_page(page); /* first page */
else if (!trylock_page(page))
break;
This logic should exclude the ignoring of writeback
state of folios/pages.
Signed-off-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Link: https://lore.kernel.org/r/20250205000249.123054-3-slava@dubeyko.com
Tested-by: David Howells <dhowells@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-04 16:02:47 -08:00
|
|
|
if (rc) {
|
2025-02-17 18:51:15 +00:00
|
|
|
folio_redirty_for_writepage(wbc, folio);
|
|
|
|
folio_unlock(folio);
|
ceph: introduce ceph_process_folio_batch() method
First step of ceph_writepages_start() logic is
of finding the dirty memory folios and processing it.
This patch introduces ceph_process_folio_batch()
method that moves this logic into dedicated method.
The ceph_writepages_start() has this logic:
if (ceph_wbc.locked_pages == 0)
lock_page(page); /* first page */
else if (!trylock_page(page))
break;
<skipped>
if (folio_test_writeback(folio) ||
folio_test_private_2(folio) /* [DEPRECATED] */) {
if (wbc->sync_mode == WB_SYNC_NONE) {
doutc(cl, "%p under writeback\n", folio);
folio_unlock(folio);
continue;
}
doutc(cl, "waiting on writeback %p\n", folio);
folio_wait_writeback(folio);
folio_wait_private_2(folio); /* [DEPRECATED] */
}
The problem here that folio/page is locked here at first
and it is by set_page_writeback(page) later before
submitting the write request. The folio/page is unlocked
by writepages_finish() after finishing the write
request. It means that logic of checking folio_test_writeback()
and folio_wait_writeback() never works because page is locked
and it cannot be locked again until write request completion.
However, for majority of folios/pages the trylock_page()
is used. As a result, multiple threads can try to lock the same
folios/pages multiple times even if they are under writeback
already. It makes this logic more compute intensive than
it is necessary.
This patch changes this logic:
if (folio_test_writeback(folio) ||
folio_test_private_2(folio) /* [DEPRECATED] */) {
if (wbc->sync_mode == WB_SYNC_NONE) {
doutc(cl, "%p under writeback\n", folio);
folio_unlock(folio);
continue;
}
doutc(cl, "waiting on writeback %p\n", folio);
folio_wait_writeback(folio);
folio_wait_private_2(folio); /* [DEPRECATED] */
}
if (ceph_wbc.locked_pages == 0)
lock_page(page); /* first page */
else if (!trylock_page(page))
break;
This logic should exclude the ignoring of writeback
state of folios/pages.
Signed-off-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Link: https://lore.kernel.org/r/20250205000249.123054-3-slava@dubeyko.com
Tested-by: David Howells <dhowells@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-04 16:02:47 -08:00
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
ceph_wbc->fbatch.folios[i] = NULL;
|
2025-02-17 18:51:15 +00:00
|
|
|
ceph_wbc->len += folio_size(folio);
|
ceph: introduce ceph_process_folio_batch() method
First step of ceph_writepages_start() logic is
of finding the dirty memory folios and processing it.
This patch introduces ceph_process_folio_batch()
method that moves this logic into dedicated method.
The ceph_writepages_start() has this logic:
if (ceph_wbc.locked_pages == 0)
lock_page(page); /* first page */
else if (!trylock_page(page))
break;
<skipped>
if (folio_test_writeback(folio) ||
folio_test_private_2(folio) /* [DEPRECATED] */) {
if (wbc->sync_mode == WB_SYNC_NONE) {
doutc(cl, "%p under writeback\n", folio);
folio_unlock(folio);
continue;
}
doutc(cl, "waiting on writeback %p\n", folio);
folio_wait_writeback(folio);
folio_wait_private_2(folio); /* [DEPRECATED] */
}
The problem here that folio/page is locked here at first
and it is by set_page_writeback(page) later before
submitting the write request. The folio/page is unlocked
by writepages_finish() after finishing the write
request. It means that logic of checking folio_test_writeback()
and folio_wait_writeback() never works because page is locked
and it cannot be locked again until write request completion.
However, for majority of folios/pages the trylock_page()
is used. As a result, multiple threads can try to lock the same
folios/pages multiple times even if they are under writeback
already. It makes this logic more compute intensive than
it is necessary.
This patch changes this logic:
if (folio_test_writeback(folio) ||
folio_test_private_2(folio) /* [DEPRECATED] */) {
if (wbc->sync_mode == WB_SYNC_NONE) {
doutc(cl, "%p under writeback\n", folio);
folio_unlock(folio);
continue;
}
doutc(cl, "waiting on writeback %p\n", folio);
folio_wait_writeback(folio);
folio_wait_private_2(folio); /* [DEPRECATED] */
}
if (ceph_wbc.locked_pages == 0)
lock_page(page); /* first page */
else if (!trylock_page(page))
break;
This logic should exclude the ignoring of writeback
state of folios/pages.
Signed-off-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Link: https://lore.kernel.org/r/20250205000249.123054-3-slava@dubeyko.com
Tested-by: David Howells <dhowells@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-04 16:02:47 -08:00
|
|
|
}
|
|
|
|
|
|
|
|
ceph_wbc->processed_in_fbatch = i;
|
|
|
|
|
|
|
|
return rc;
|
|
|
|
}
|
|
|
|
|
2025-02-04 16:02:48 -08:00
|
|
|
static inline
|
|
|
|
void ceph_shift_unused_folios_left(struct folio_batch *fbatch)
|
|
|
|
{
|
|
|
|
unsigned j, n = 0;
|
|
|
|
|
|
|
|
/* shift unused page to beginning of fbatch */
|
|
|
|
for (j = 0; j < folio_batch_count(fbatch); j++) {
|
|
|
|
if (!fbatch->folios[j])
|
|
|
|
continue;
|
|
|
|
|
|
|
|
if (n < j) {
|
|
|
|
fbatch->folios[n] = fbatch->folios[j];
|
|
|
|
}
|
|
|
|
|
|
|
|
n++;
|
|
|
|
}
|
|
|
|
|
|
|
|
fbatch->nr = n;
|
|
|
|
}
|
|
|
|
|
|
|
|
static
|
|
|
|
int ceph_submit_write(struct address_space *mapping,
|
|
|
|
struct writeback_control *wbc,
|
|
|
|
struct ceph_writeback_ctl *ceph_wbc)
|
|
|
|
{
|
|
|
|
struct inode *inode = mapping->host;
|
|
|
|
struct ceph_inode_info *ci = ceph_inode(inode);
|
|
|
|
struct ceph_fs_client *fsc = ceph_inode_to_fs_client(inode);
|
|
|
|
struct ceph_client *cl = fsc->client;
|
|
|
|
struct ceph_vino vino = ceph_vino(inode);
|
|
|
|
struct ceph_osd_request *req = NULL;
|
|
|
|
struct page *page = NULL;
|
|
|
|
bool caching = ceph_is_cache_enabled(inode);
|
|
|
|
u64 offset;
|
|
|
|
u64 len;
|
|
|
|
unsigned i;
|
|
|
|
|
|
|
|
new_request:
|
|
|
|
offset = ceph_fscrypt_page_offset(ceph_wbc->pages[0]);
|
|
|
|
len = ceph_wbc->wsize;
|
|
|
|
|
|
|
|
req = ceph_osdc_new_request(&fsc->client->osdc,
|
|
|
|
&ci->i_layout, vino,
|
|
|
|
offset, &len, 0, ceph_wbc->num_ops,
|
|
|
|
CEPH_OSD_OP_WRITE, CEPH_OSD_FLAG_WRITE,
|
|
|
|
ceph_wbc->snapc, ceph_wbc->truncate_seq,
|
|
|
|
ceph_wbc->truncate_size, false);
|
|
|
|
if (IS_ERR(req)) {
|
|
|
|
req = ceph_osdc_new_request(&fsc->client->osdc,
|
|
|
|
&ci->i_layout, vino,
|
|
|
|
offset, &len, 0,
|
|
|
|
min(ceph_wbc->num_ops,
|
|
|
|
CEPH_OSD_SLAB_OPS),
|
|
|
|
CEPH_OSD_OP_WRITE,
|
|
|
|
CEPH_OSD_FLAG_WRITE,
|
|
|
|
ceph_wbc->snapc,
|
|
|
|
ceph_wbc->truncate_seq,
|
|
|
|
ceph_wbc->truncate_size,
|
|
|
|
true);
|
|
|
|
BUG_ON(IS_ERR(req));
|
|
|
|
}
|
|
|
|
|
|
|
|
page = ceph_wbc->pages[ceph_wbc->locked_pages - 1];
|
|
|
|
BUG_ON(len < ceph_fscrypt_page_offset(page) + thp_size(page) - offset);
|
|
|
|
|
|
|
|
if (!ceph_inc_osd_stopping_blocker(fsc->mdsc)) {
|
|
|
|
for (i = 0; i < folio_batch_count(&ceph_wbc->fbatch); i++) {
|
|
|
|
struct folio *folio = ceph_wbc->fbatch.folios[i];
|
|
|
|
|
|
|
|
if (!folio)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
page = &folio->page;
|
|
|
|
redirty_page_for_writepage(wbc, page);
|
|
|
|
unlock_page(page);
|
|
|
|
}
|
|
|
|
|
|
|
|
for (i = 0; i < ceph_wbc->locked_pages; i++) {
|
|
|
|
page = ceph_fscrypt_pagecache_page(ceph_wbc->pages[i]);
|
|
|
|
|
|
|
|
if (!page)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
redirty_page_for_writepage(wbc, page);
|
|
|
|
unlock_page(page);
|
|
|
|
}
|
|
|
|
|
|
|
|
ceph_osdc_put_request(req);
|
|
|
|
return -EIO;
|
|
|
|
}
|
|
|
|
|
|
|
|
req->r_callback = writepages_finish;
|
|
|
|
req->r_inode = inode;
|
|
|
|
|
|
|
|
/* Format the osd request message and submit the write */
|
|
|
|
len = 0;
|
|
|
|
ceph_wbc->data_pages = ceph_wbc->pages;
|
|
|
|
ceph_wbc->op_idx = 0;
|
|
|
|
for (i = 0; i < ceph_wbc->locked_pages; i++) {
|
|
|
|
u64 cur_offset;
|
|
|
|
|
|
|
|
page = ceph_fscrypt_pagecache_page(ceph_wbc->pages[i]);
|
|
|
|
cur_offset = page_offset(page);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Discontinuity in page range? Ceph can handle that by just passing
|
|
|
|
* multiple extents in the write op.
|
|
|
|
*/
|
|
|
|
if (offset + len != cur_offset) {
|
|
|
|
/* If it's full, stop here */
|
|
|
|
if (ceph_wbc->op_idx + 1 == req->r_num_ops)
|
|
|
|
break;
|
|
|
|
|
|
|
|
/* Kick off an fscache write with what we have so far. */
|
|
|
|
ceph_fscache_write_to_cache(inode, offset, len, caching);
|
|
|
|
|
|
|
|
/* Start a new extent */
|
|
|
|
osd_req_op_extent_dup_last(req, ceph_wbc->op_idx,
|
|
|
|
cur_offset - offset);
|
|
|
|
|
|
|
|
doutc(cl, "got pages at %llu~%llu\n", offset, len);
|
|
|
|
|
|
|
|
osd_req_op_extent_osd_data_pages(req, ceph_wbc->op_idx,
|
|
|
|
ceph_wbc->data_pages,
|
|
|
|
len, 0,
|
|
|
|
ceph_wbc->from_pool,
|
|
|
|
false);
|
|
|
|
osd_req_op_extent_update(req, ceph_wbc->op_idx, len);
|
|
|
|
|
|
|
|
len = 0;
|
|
|
|
offset = cur_offset;
|
|
|
|
ceph_wbc->data_pages = ceph_wbc->pages + i;
|
|
|
|
ceph_wbc->op_idx++;
|
|
|
|
}
|
|
|
|
|
|
|
|
set_page_writeback(page);
|
|
|
|
|
|
|
|
if (caching)
|
|
|
|
ceph_set_page_fscache(page);
|
|
|
|
|
|
|
|
len += thp_size(page);
|
|
|
|
}
|
|
|
|
|
|
|
|
ceph_fscache_write_to_cache(inode, offset, len, caching);
|
|
|
|
|
|
|
|
if (ceph_wbc->size_stable) {
|
|
|
|
len = min(len, ceph_wbc->i_size - offset);
|
|
|
|
} else if (i == ceph_wbc->locked_pages) {
|
|
|
|
/* writepages_finish() clears writeback pages
|
|
|
|
* according to the data length, so make sure
|
|
|
|
* data length covers all locked pages */
|
|
|
|
u64 min_len = len + 1 - thp_size(page);
|
|
|
|
len = get_writepages_data_length(inode,
|
|
|
|
ceph_wbc->pages[i - 1],
|
|
|
|
offset);
|
|
|
|
len = max(len, min_len);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (IS_ENCRYPTED(inode))
|
|
|
|
len = round_up(len, CEPH_FSCRYPT_BLOCK_SIZE);
|
|
|
|
|
|
|
|
doutc(cl, "got pages at %llu~%llu\n", offset, len);
|
|
|
|
|
|
|
|
if (IS_ENCRYPTED(inode) &&
|
|
|
|
((offset | len) & ~CEPH_FSCRYPT_BLOCK_MASK)) {
|
|
|
|
pr_warn_client(cl,
|
|
|
|
"bad encrypted write offset=%lld len=%llu\n",
|
|
|
|
offset, len);
|
|
|
|
}
|
|
|
|
|
|
|
|
osd_req_op_extent_osd_data_pages(req, ceph_wbc->op_idx,
|
|
|
|
ceph_wbc->data_pages, len,
|
|
|
|
0, ceph_wbc->from_pool, false);
|
|
|
|
osd_req_op_extent_update(req, ceph_wbc->op_idx, len);
|
|
|
|
|
|
|
|
BUG_ON(ceph_wbc->op_idx + 1 != req->r_num_ops);
|
|
|
|
|
|
|
|
ceph_wbc->from_pool = false;
|
|
|
|
if (i < ceph_wbc->locked_pages) {
|
|
|
|
BUG_ON(ceph_wbc->num_ops <= req->r_num_ops);
|
|
|
|
ceph_wbc->num_ops -= req->r_num_ops;
|
|
|
|
ceph_wbc->locked_pages -= i;
|
|
|
|
|
|
|
|
/* allocate new pages array for next request */
|
|
|
|
ceph_wbc->data_pages = ceph_wbc->pages;
|
|
|
|
__ceph_allocate_page_array(ceph_wbc, ceph_wbc->locked_pages);
|
|
|
|
memcpy(ceph_wbc->pages, ceph_wbc->data_pages + i,
|
|
|
|
ceph_wbc->locked_pages * sizeof(*ceph_wbc->pages));
|
|
|
|
memset(ceph_wbc->data_pages + i, 0,
|
|
|
|
ceph_wbc->locked_pages * sizeof(*ceph_wbc->pages));
|
|
|
|
} else {
|
|
|
|
BUG_ON(ceph_wbc->num_ops != req->r_num_ops);
|
|
|
|
/* request message now owns the pages array */
|
|
|
|
ceph_wbc->pages = NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
req->r_mtime = inode_get_mtime(inode);
|
|
|
|
ceph_osdc_start_request(&fsc->client->osdc, req);
|
|
|
|
req = NULL;
|
|
|
|
|
|
|
|
wbc->nr_to_write -= i;
|
|
|
|
if (ceph_wbc->pages)
|
|
|
|
goto new_request;
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static
|
|
|
|
void ceph_wait_until_current_writes_complete(struct address_space *mapping,
|
|
|
|
struct writeback_control *wbc,
|
|
|
|
struct ceph_writeback_ctl *ceph_wbc)
|
|
|
|
{
|
|
|
|
struct page *page;
|
|
|
|
unsigned i, nr;
|
|
|
|
|
|
|
|
if (wbc->sync_mode != WB_SYNC_NONE &&
|
|
|
|
ceph_wbc->start_index == 0 && /* all dirty pages were checked */
|
|
|
|
!ceph_wbc->head_snapc) {
|
|
|
|
ceph_wbc->index = 0;
|
|
|
|
|
|
|
|
while ((ceph_wbc->index <= ceph_wbc->end) &&
|
|
|
|
(nr = filemap_get_folios_tag(mapping,
|
|
|
|
&ceph_wbc->index,
|
|
|
|
(pgoff_t)-1,
|
|
|
|
PAGECACHE_TAG_WRITEBACK,
|
|
|
|
&ceph_wbc->fbatch))) {
|
|
|
|
for (i = 0; i < nr; i++) {
|
|
|
|
page = &ceph_wbc->fbatch.folios[i]->page;
|
|
|
|
if (page_snap_context(page) != ceph_wbc->snapc)
|
|
|
|
continue;
|
|
|
|
wait_on_page_writeback(page);
|
|
|
|
}
|
|
|
|
|
|
|
|
folio_batch_release(&ceph_wbc->fbatch);
|
|
|
|
cond_resched();
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2009-10-06 11:31:09 -07:00
|
|
|
/*
|
|
|
|
* initiate async writeback
|
|
|
|
*/
|
|
|
|
static int ceph_writepages_start(struct address_space *mapping,
|
|
|
|
struct writeback_control *wbc)
|
|
|
|
{
|
|
|
|
struct inode *inode = mapping->host;
|
2023-06-12 10:50:38 +08:00
|
|
|
struct ceph_fs_client *fsc = ceph_inode_to_fs_client(inode);
|
2023-06-12 09:04:07 +08:00
|
|
|
struct ceph_client *cl = fsc->client;
|
2017-08-30 11:36:06 +08:00
|
|
|
struct ceph_writeback_ctl ceph_wbc;
|
2025-02-04 16:02:46 -08:00
|
|
|
int rc = 0;
|
2009-10-06 11:31:09 -07:00
|
|
|
|
2025-02-04 16:02:48 -08:00
|
|
|
if (wbc->sync_mode == WB_SYNC_NONE && fsc->write_congested)
|
2022-03-22 14:39:04 -07:00
|
|
|
return 0;
|
|
|
|
|
2023-06-12 09:04:07 +08:00
|
|
|
doutc(cl, "%llx.%llx (mode=%s)\n", ceph_vinop(inode),
|
|
|
|
wbc->sync_mode == WB_SYNC_NONE ? "NONE" :
|
|
|
|
(wbc->sync_mode == WB_SYNC_ALL ? "ALL" : "HOLD"));
|
2009-10-06 11:31:09 -07:00
|
|
|
|
ceph: introduce ceph_process_folio_batch() method
First step of ceph_writepages_start() logic is
of finding the dirty memory folios and processing it.
This patch introduces ceph_process_folio_batch()
method that moves this logic into dedicated method.
The ceph_writepages_start() has this logic:
if (ceph_wbc.locked_pages == 0)
lock_page(page); /* first page */
else if (!trylock_page(page))
break;
<skipped>
if (folio_test_writeback(folio) ||
folio_test_private_2(folio) /* [DEPRECATED] */) {
if (wbc->sync_mode == WB_SYNC_NONE) {
doutc(cl, "%p under writeback\n", folio);
folio_unlock(folio);
continue;
}
doutc(cl, "waiting on writeback %p\n", folio);
folio_wait_writeback(folio);
folio_wait_private_2(folio); /* [DEPRECATED] */
}
The problem here that folio/page is locked here at first
and it is by set_page_writeback(page) later before
submitting the write request. The folio/page is unlocked
by writepages_finish() after finishing the write
request. It means that logic of checking folio_test_writeback()
and folio_wait_writeback() never works because page is locked
and it cannot be locked again until write request completion.
However, for majority of folios/pages the trylock_page()
is used. As a result, multiple threads can try to lock the same
folios/pages multiple times even if they are under writeback
already. It makes this logic more compute intensive than
it is necessary.
This patch changes this logic:
if (folio_test_writeback(folio) ||
folio_test_private_2(folio) /* [DEPRECATED] */) {
if (wbc->sync_mode == WB_SYNC_NONE) {
doutc(cl, "%p under writeback\n", folio);
folio_unlock(folio);
continue;
}
doutc(cl, "waiting on writeback %p\n", folio);
folio_wait_writeback(folio);
folio_wait_private_2(folio); /* [DEPRECATED] */
}
if (ceph_wbc.locked_pages == 0)
lock_page(page); /* first page */
else if (!trylock_page(page))
break;
This logic should exclude the ignoring of writeback
state of folios/pages.
Signed-off-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Link: https://lore.kernel.org/r/20250205000249.123054-3-slava@dubeyko.com
Tested-by: David Howells <dhowells@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-04 16:02:47 -08:00
|
|
|
if (is_forced_umount(mapping)) {
|
|
|
|
/* we're in a forced umount, don't write! */
|
|
|
|
return -EIO;
|
2009-10-06 11:31:09 -07:00
|
|
|
}
|
|
|
|
|
2025-02-04 16:02:46 -08:00
|
|
|
ceph_init_writeback_ctl(mapping, wbc, &ceph_wbc);
|
2009-10-06 11:31:09 -07:00
|
|
|
|
2025-02-04 16:02:49 -08:00
|
|
|
if (!ceph_inc_osd_stopping_blocker(fsc->mdsc)) {
|
|
|
|
rc = -EIO;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
2009-10-06 11:31:09 -07:00
|
|
|
retry:
|
ceph: introduce ceph_process_folio_batch() method
First step of ceph_writepages_start() logic is
of finding the dirty memory folios and processing it.
This patch introduces ceph_process_folio_batch()
method that moves this logic into dedicated method.
The ceph_writepages_start() has this logic:
if (ceph_wbc.locked_pages == 0)
lock_page(page); /* first page */
else if (!trylock_page(page))
break;
<skipped>
if (folio_test_writeback(folio) ||
folio_test_private_2(folio) /* [DEPRECATED] */) {
if (wbc->sync_mode == WB_SYNC_NONE) {
doutc(cl, "%p under writeback\n", folio);
folio_unlock(folio);
continue;
}
doutc(cl, "waiting on writeback %p\n", folio);
folio_wait_writeback(folio);
folio_wait_private_2(folio); /* [DEPRECATED] */
}
The problem here that folio/page is locked here at first
and it is by set_page_writeback(page) later before
submitting the write request. The folio/page is unlocked
by writepages_finish() after finishing the write
request. It means that logic of checking folio_test_writeback()
and folio_wait_writeback() never works because page is locked
and it cannot be locked again until write request completion.
However, for majority of folios/pages the trylock_page()
is used. As a result, multiple threads can try to lock the same
folios/pages multiple times even if they are under writeback
already. It makes this logic more compute intensive than
it is necessary.
This patch changes this logic:
if (folio_test_writeback(folio) ||
folio_test_private_2(folio) /* [DEPRECATED] */) {
if (wbc->sync_mode == WB_SYNC_NONE) {
doutc(cl, "%p under writeback\n", folio);
folio_unlock(folio);
continue;
}
doutc(cl, "waiting on writeback %p\n", folio);
folio_wait_writeback(folio);
folio_wait_private_2(folio); /* [DEPRECATED] */
}
if (ceph_wbc.locked_pages == 0)
lock_page(page); /* first page */
else if (!trylock_page(page))
break;
This logic should exclude the ignoring of writeback
state of folios/pages.
Signed-off-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Link: https://lore.kernel.org/r/20250205000249.123054-3-slava@dubeyko.com
Tested-by: David Howells <dhowells@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-04 16:02:47 -08:00
|
|
|
rc = ceph_define_writeback_range(mapping, wbc, &ceph_wbc);
|
|
|
|
if (rc == -ENODATA) {
|
2009-10-06 11:31:09 -07:00
|
|
|
/* hmm, why does writepages get called when there
|
|
|
|
is no dirty data? */
|
ceph: introduce ceph_process_folio_batch() method
First step of ceph_writepages_start() logic is
of finding the dirty memory folios and processing it.
This patch introduces ceph_process_folio_batch()
method that moves this logic into dedicated method.
The ceph_writepages_start() has this logic:
if (ceph_wbc.locked_pages == 0)
lock_page(page); /* first page */
else if (!trylock_page(page))
break;
<skipped>
if (folio_test_writeback(folio) ||
folio_test_private_2(folio) /* [DEPRECATED] */) {
if (wbc->sync_mode == WB_SYNC_NONE) {
doutc(cl, "%p under writeback\n", folio);
folio_unlock(folio);
continue;
}
doutc(cl, "waiting on writeback %p\n", folio);
folio_wait_writeback(folio);
folio_wait_private_2(folio); /* [DEPRECATED] */
}
The problem here that folio/page is locked here at first
and it is by set_page_writeback(page) later before
submitting the write request. The folio/page is unlocked
by writepages_finish() after finishing the write
request. It means that logic of checking folio_test_writeback()
and folio_wait_writeback() never works because page is locked
and it cannot be locked again until write request completion.
However, for majority of folios/pages the trylock_page()
is used. As a result, multiple threads can try to lock the same
folios/pages multiple times even if they are under writeback
already. It makes this logic more compute intensive than
it is necessary.
This patch changes this logic:
if (folio_test_writeback(folio) ||
folio_test_private_2(folio) /* [DEPRECATED] */) {
if (wbc->sync_mode == WB_SYNC_NONE) {
doutc(cl, "%p under writeback\n", folio);
folio_unlock(folio);
continue;
}
doutc(cl, "waiting on writeback %p\n", folio);
folio_wait_writeback(folio);
folio_wait_private_2(folio); /* [DEPRECATED] */
}
if (ceph_wbc.locked_pages == 0)
lock_page(page); /* first page */
else if (!trylock_page(page))
break;
This logic should exclude the ignoring of writeback
state of folios/pages.
Signed-off-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Link: https://lore.kernel.org/r/20250205000249.123054-3-slava@dubeyko.com
Tested-by: David Howells <dhowells@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-04 16:02:47 -08:00
|
|
|
rc = 0;
|
2025-02-04 16:02:49 -08:00
|
|
|
goto dec_osd_stopping_blocker;
|
2009-10-06 11:31:09 -07:00
|
|
|
}
|
2017-09-01 16:53:58 +08:00
|
|
|
|
2023-03-08 10:21:44 +08:00
|
|
|
if (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages)
|
2025-02-04 16:02:46 -08:00
|
|
|
tag_pages_for_writeback(mapping, ceph_wbc.index, ceph_wbc.end);
|
2023-03-08 10:21:44 +08:00
|
|
|
|
ceph: introduce ceph_process_folio_batch() method
First step of ceph_writepages_start() logic is
of finding the dirty memory folios and processing it.
This patch introduces ceph_process_folio_batch()
method that moves this logic into dedicated method.
The ceph_writepages_start() has this logic:
if (ceph_wbc.locked_pages == 0)
lock_page(page); /* first page */
else if (!trylock_page(page))
break;
<skipped>
if (folio_test_writeback(folio) ||
folio_test_private_2(folio) /* [DEPRECATED] */) {
if (wbc->sync_mode == WB_SYNC_NONE) {
doutc(cl, "%p under writeback\n", folio);
folio_unlock(folio);
continue;
}
doutc(cl, "waiting on writeback %p\n", folio);
folio_wait_writeback(folio);
folio_wait_private_2(folio); /* [DEPRECATED] */
}
The problem here that folio/page is locked here at first
and it is by set_page_writeback(page) later before
submitting the write request. The folio/page is unlocked
by writepages_finish() after finishing the write
request. It means that logic of checking folio_test_writeback()
and folio_wait_writeback() never works because page is locked
and it cannot be locked again until write request completion.
However, for majority of folios/pages the trylock_page()
is used. As a result, multiple threads can try to lock the same
folios/pages multiple times even if they are under writeback
already. It makes this logic more compute intensive than
it is necessary.
This patch changes this logic:
if (folio_test_writeback(folio) ||
folio_test_private_2(folio) /* [DEPRECATED] */) {
if (wbc->sync_mode == WB_SYNC_NONE) {
doutc(cl, "%p under writeback\n", folio);
folio_unlock(folio);
continue;
}
doutc(cl, "waiting on writeback %p\n", folio);
folio_wait_writeback(folio);
folio_wait_private_2(folio); /* [DEPRECATED] */
}
if (ceph_wbc.locked_pages == 0)
lock_page(page); /* first page */
else if (!trylock_page(page))
break;
This logic should exclude the ignoring of writeback
state of folios/pages.
Signed-off-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Link: https://lore.kernel.org/r/20250205000249.123054-3-slava@dubeyko.com
Tested-by: David Howells <dhowells@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-04 16:02:47 -08:00
|
|
|
while (!has_writeback_done(&ceph_wbc)) {
|
|
|
|
ceph_wbc.locked_pages = 0;
|
2025-02-04 16:02:46 -08:00
|
|
|
ceph_wbc.max_pages = ceph_wbc.wsize >> PAGE_SHIFT;
|
2009-10-06 11:31:09 -07:00
|
|
|
|
|
|
|
get_more_pages:
|
ceph: introduce ceph_process_folio_batch() method
First step of ceph_writepages_start() logic is
of finding the dirty memory folios and processing it.
This patch introduces ceph_process_folio_batch()
method that moves this logic into dedicated method.
The ceph_writepages_start() has this logic:
if (ceph_wbc.locked_pages == 0)
lock_page(page); /* first page */
else if (!trylock_page(page))
break;
<skipped>
if (folio_test_writeback(folio) ||
folio_test_private_2(folio) /* [DEPRECATED] */) {
if (wbc->sync_mode == WB_SYNC_NONE) {
doutc(cl, "%p under writeback\n", folio);
folio_unlock(folio);
continue;
}
doutc(cl, "waiting on writeback %p\n", folio);
folio_wait_writeback(folio);
folio_wait_private_2(folio); /* [DEPRECATED] */
}
The problem here that folio/page is locked here at first
and it is by set_page_writeback(page) later before
submitting the write request. The folio/page is unlocked
by writepages_finish() after finishing the write
request. It means that logic of checking folio_test_writeback()
and folio_wait_writeback() never works because page is locked
and it cannot be locked again until write request completion.
However, for majority of folios/pages the trylock_page()
is used. As a result, multiple threads can try to lock the same
folios/pages multiple times even if they are under writeback
already. It makes this logic more compute intensive than
it is necessary.
This patch changes this logic:
if (folio_test_writeback(folio) ||
folio_test_private_2(folio) /* [DEPRECATED] */) {
if (wbc->sync_mode == WB_SYNC_NONE) {
doutc(cl, "%p under writeback\n", folio);
folio_unlock(folio);
continue;
}
doutc(cl, "waiting on writeback %p\n", folio);
folio_wait_writeback(folio);
folio_wait_private_2(folio); /* [DEPRECATED] */
}
if (ceph_wbc.locked_pages == 0)
lock_page(page); /* first page */
else if (!trylock_page(page))
break;
This logic should exclude the ignoring of writeback
state of folios/pages.
Signed-off-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Link: https://lore.kernel.org/r/20250205000249.123054-3-slava@dubeyko.com
Tested-by: David Howells <dhowells@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-04 16:02:47 -08:00
|
|
|
ceph_folio_batch_reinit(&ceph_wbc);
|
|
|
|
|
2025-02-04 16:02:46 -08:00
|
|
|
ceph_wbc.nr_folios = filemap_get_folios_tag(mapping,
|
|
|
|
&ceph_wbc.index,
|
|
|
|
ceph_wbc.end,
|
|
|
|
ceph_wbc.tag,
|
|
|
|
&ceph_wbc.fbatch);
|
ceph: introduce ceph_process_folio_batch() method
First step of ceph_writepages_start() logic is
of finding the dirty memory folios and processing it.
This patch introduces ceph_process_folio_batch()
method that moves this logic into dedicated method.
The ceph_writepages_start() has this logic:
if (ceph_wbc.locked_pages == 0)
lock_page(page); /* first page */
else if (!trylock_page(page))
break;
<skipped>
if (folio_test_writeback(folio) ||
folio_test_private_2(folio) /* [DEPRECATED] */) {
if (wbc->sync_mode == WB_SYNC_NONE) {
doutc(cl, "%p under writeback\n", folio);
folio_unlock(folio);
continue;
}
doutc(cl, "waiting on writeback %p\n", folio);
folio_wait_writeback(folio);
folio_wait_private_2(folio); /* [DEPRECATED] */
}
The problem here that folio/page is locked here at first
and it is by set_page_writeback(page) later before
submitting the write request. The folio/page is unlocked
by writepages_finish() after finishing the write
request. It means that logic of checking folio_test_writeback()
and folio_wait_writeback() never works because page is locked
and it cannot be locked again until write request completion.
However, for majority of folios/pages the trylock_page()
is used. As a result, multiple threads can try to lock the same
folios/pages multiple times even if they are under writeback
already. It makes this logic more compute intensive than
it is necessary.
This patch changes this logic:
if (folio_test_writeback(folio) ||
folio_test_private_2(folio) /* [DEPRECATED] */) {
if (wbc->sync_mode == WB_SYNC_NONE) {
doutc(cl, "%p under writeback\n", folio);
folio_unlock(folio);
continue;
}
doutc(cl, "waiting on writeback %p\n", folio);
folio_wait_writeback(folio);
folio_wait_private_2(folio); /* [DEPRECATED] */
}
if (ceph_wbc.locked_pages == 0)
lock_page(page); /* first page */
else if (!trylock_page(page))
break;
This logic should exclude the ignoring of writeback
state of folios/pages.
Signed-off-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Link: https://lore.kernel.org/r/20250205000249.123054-3-slava@dubeyko.com
Tested-by: David Howells <dhowells@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-04 16:02:47 -08:00
|
|
|
doutc(cl, "pagevec_lookup_range_tag for tag %#x got %d\n",
|
|
|
|
ceph_wbc.tag, ceph_wbc.nr_folios);
|
|
|
|
|
2025-02-04 16:02:46 -08:00
|
|
|
if (!ceph_wbc.nr_folios && !ceph_wbc.locked_pages)
|
2009-10-06 11:31:09 -07:00
|
|
|
break;
|
|
|
|
|
2025-02-04 16:02:48 -08:00
|
|
|
process_folio_batch:
|
ceph: introduce ceph_process_folio_batch() method
First step of ceph_writepages_start() logic is
of finding the dirty memory folios and processing it.
This patch introduces ceph_process_folio_batch()
method that moves this logic into dedicated method.
The ceph_writepages_start() has this logic:
if (ceph_wbc.locked_pages == 0)
lock_page(page); /* first page */
else if (!trylock_page(page))
break;
<skipped>
if (folio_test_writeback(folio) ||
folio_test_private_2(folio) /* [DEPRECATED] */) {
if (wbc->sync_mode == WB_SYNC_NONE) {
doutc(cl, "%p under writeback\n", folio);
folio_unlock(folio);
continue;
}
doutc(cl, "waiting on writeback %p\n", folio);
folio_wait_writeback(folio);
folio_wait_private_2(folio); /* [DEPRECATED] */
}
The problem here that folio/page is locked here at first
and it is by set_page_writeback(page) later before
submitting the write request. The folio/page is unlocked
by writepages_finish() after finishing the write
request. It means that logic of checking folio_test_writeback()
and folio_wait_writeback() never works because page is locked
and it cannot be locked again until write request completion.
However, for majority of folios/pages the trylock_page()
is used. As a result, multiple threads can try to lock the same
folios/pages multiple times even if they are under writeback
already. It makes this logic more compute intensive than
it is necessary.
This patch changes this logic:
if (folio_test_writeback(folio) ||
folio_test_private_2(folio) /* [DEPRECATED] */) {
if (wbc->sync_mode == WB_SYNC_NONE) {
doutc(cl, "%p under writeback\n", folio);
folio_unlock(folio);
continue;
}
doutc(cl, "waiting on writeback %p\n", folio);
folio_wait_writeback(folio);
folio_wait_private_2(folio); /* [DEPRECATED] */
}
if (ceph_wbc.locked_pages == 0)
lock_page(page); /* first page */
else if (!trylock_page(page))
break;
This logic should exclude the ignoring of writeback
state of folios/pages.
Signed-off-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Link: https://lore.kernel.org/r/20250205000249.123054-3-slava@dubeyko.com
Tested-by: David Howells <dhowells@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-04 16:02:47 -08:00
|
|
|
rc = ceph_process_folio_batch(mapping, wbc, &ceph_wbc);
|
|
|
|
if (rc)
|
|
|
|
goto release_folios;
|
2009-10-06 11:31:09 -07:00
|
|
|
|
|
|
|
/* did we get anything? */
|
2025-02-04 16:02:46 -08:00
|
|
|
if (!ceph_wbc.locked_pages)
|
2023-01-04 13:14:33 -08:00
|
|
|
goto release_folios;
|
2009-10-06 11:31:09 -07:00
|
|
|
|
2025-02-04 16:02:48 -08:00
|
|
|
if (ceph_wbc.processed_in_fbatch) {
|
|
|
|
ceph_shift_unused_folios_left(&ceph_wbc.fbatch);
|
|
|
|
|
|
|
|
if (folio_batch_count(&ceph_wbc.fbatch) == 0 &&
|
2025-02-04 16:02:46 -08:00
|
|
|
ceph_wbc.locked_pages < ceph_wbc.max_pages) {
|
2023-06-12 09:04:07 +08:00
|
|
|
doutc(cl, "reached end fbatch, trying for more\n");
|
2009-10-06 11:31:09 -07:00
|
|
|
goto get_more_pages;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2025-02-04 16:02:48 -08:00
|
|
|
rc = ceph_submit_write(mapping, wbc, &ceph_wbc);
|
|
|
|
if (rc)
|
2023-05-09 16:38:49 +08:00
|
|
|
goto release_folios;
|
2016-01-07 16:00:17 +08:00
|
|
|
|
2025-02-04 16:02:48 -08:00
|
|
|
ceph_wbc.locked_pages = 0;
|
|
|
|
ceph_wbc.strip_unit_end = 0;
|
2013-03-14 14:09:05 -05:00
|
|
|
|
2025-02-04 16:02:48 -08:00
|
|
|
if (folio_batch_count(&ceph_wbc.fbatch) > 0) {
|
|
|
|
ceph_wbc.nr_folios =
|
|
|
|
folio_batch_count(&ceph_wbc.fbatch);
|
|
|
|
goto process_folio_batch;
|
2016-01-07 16:00:17 +08:00
|
|
|
}
|
2013-03-14 14:09:05 -05:00
|
|
|
|
2017-09-01 16:53:58 +08:00
|
|
|
/*
|
|
|
|
* We stop writing back only if we are not doing
|
|
|
|
* integrity sync. In case of integrity sync we have to
|
|
|
|
* keep going until we have written all the pages
|
|
|
|
* we tagged for writeback prior to entering this loop.
|
|
|
|
*/
|
|
|
|
if (wbc->nr_to_write <= 0 && wbc->sync_mode == WB_SYNC_NONE)
|
2025-02-04 16:02:46 -08:00
|
|
|
ceph_wbc.done = true;
|
2009-10-06 11:31:09 -07:00
|
|
|
|
2023-01-04 13:14:33 -08:00
|
|
|
release_folios:
|
2023-06-12 09:04:07 +08:00
|
|
|
doutc(cl, "folio_batch release on %d folios (%p)\n",
|
2025-02-04 16:02:46 -08:00
|
|
|
(int)ceph_wbc.fbatch.nr,
|
|
|
|
ceph_wbc.fbatch.nr ? ceph_wbc.fbatch.folios[0] : NULL);
|
|
|
|
folio_batch_release(&ceph_wbc.fbatch);
|
2009-10-06 11:31:09 -07:00
|
|
|
}
|
|
|
|
|
2025-02-04 16:02:46 -08:00
|
|
|
if (ceph_wbc.should_loop && !ceph_wbc.done) {
|
2009-10-06 11:31:09 -07:00
|
|
|
/* more to do; loop back to beginning of file */
|
2023-06-12 09:04:07 +08:00
|
|
|
doutc(cl, "looping back to beginning of file\n");
|
2025-02-04 16:02:48 -08:00
|
|
|
/* OK even when start_index == 0 */
|
|
|
|
ceph_wbc.end = ceph_wbc.start_index - 1;
|
2017-09-01 17:03:16 +08:00
|
|
|
|
|
|
|
/* to write dirty pages associated with next snapc,
|
|
|
|
* we need to wait until current writes complete */
|
2025-02-04 16:02:48 -08:00
|
|
|
ceph_wait_until_current_writes_complete(mapping, wbc, &ceph_wbc);
|
2017-09-01 17:03:16 +08:00
|
|
|
|
2025-02-04 16:02:46 -08:00
|
|
|
ceph_wbc.start_index = 0;
|
|
|
|
ceph_wbc.index = 0;
|
2009-10-06 11:31:09 -07:00
|
|
|
goto retry;
|
|
|
|
}
|
|
|
|
|
2025-02-04 16:02:46 -08:00
|
|
|
if (wbc->range_cyclic || (ceph_wbc.range_whole && wbc->nr_to_write > 0))
|
|
|
|
mapping->writeback_index = ceph_wbc.index;
|
2009-10-06 11:31:09 -07:00
|
|
|
|
2025-02-04 16:02:49 -08:00
|
|
|
dec_osd_stopping_blocker:
|
|
|
|
ceph_dec_osd_stopping_blocker(fsc->mdsc);
|
|
|
|
|
2009-10-06 11:31:09 -07:00
|
|
|
out:
|
2025-02-04 16:02:46 -08:00
|
|
|
ceph_put_snap_context(ceph_wbc.last_snapc);
|
2023-06-12 09:04:07 +08:00
|
|
|
doutc(cl, "%llx.%llx dend - startone, rc = %d\n", ceph_vinop(inode),
|
|
|
|
rc);
|
2025-02-04 16:02:48 -08:00
|
|
|
|
2009-10-06 11:31:09 -07:00
|
|
|
return rc;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* See if a given @snapc is either writeable, or already written.
|
|
|
|
*/
|
|
|
|
static int context_is_writeable_or_written(struct inode *inode,
|
|
|
|
struct ceph_snap_context *snapc)
|
|
|
|
{
|
2017-09-02 10:50:48 +08:00
|
|
|
struct ceph_snap_context *oldest = get_oldest_context(inode, NULL, NULL);
|
2010-03-31 22:01:38 -07:00
|
|
|
int ret = !oldest || snapc->seq <= oldest->seq;
|
|
|
|
|
|
|
|
ceph_put_snap_context(oldest);
|
|
|
|
return ret;
|
2009-10-06 11:31:09 -07:00
|
|
|
}
|
|
|
|
|
2020-05-28 13:56:54 -04:00
|
|
|
/**
|
|
|
|
* ceph_find_incompatible - find an incompatible context and return it
|
2025-02-17 18:51:11 +00:00
|
|
|
* @folio: folio being dirtied
|
2010-03-19 13:27:53 -07:00
|
|
|
*
|
2025-02-17 18:51:11 +00:00
|
|
|
* We are only allowed to write into/dirty a folio if the folio is
|
2020-05-28 13:56:54 -04:00
|
|
|
* clean, or already dirty within the same snap context. Returns a
|
|
|
|
* conflicting context if there is one, NULL if there isn't, or a
|
|
|
|
* negative error code on other errors.
|
|
|
|
*
|
2025-02-17 18:51:11 +00:00
|
|
|
* Must be called with folio lock held.
|
2009-10-06 11:31:09 -07:00
|
|
|
*/
|
2020-05-28 13:56:54 -04:00
|
|
|
static struct ceph_snap_context *
|
2025-02-17 18:51:11 +00:00
|
|
|
ceph_find_incompatible(struct folio *folio)
|
2009-10-06 11:31:09 -07:00
|
|
|
{
|
2025-02-17 18:51:11 +00:00
|
|
|
struct inode *inode = folio->mapping->host;
|
2023-06-12 09:04:07 +08:00
|
|
|
struct ceph_client *cl = ceph_inode_to_client(inode);
|
2009-10-06 11:31:09 -07:00
|
|
|
struct ceph_inode_info *ci = ceph_inode(inode);
|
|
|
|
|
2021-08-31 13:39:13 -04:00
|
|
|
if (ceph_inode_is_shutdown(inode)) {
|
2025-02-17 18:51:11 +00:00
|
|
|
doutc(cl, " %llx.%llx folio %p is shutdown\n",
|
|
|
|
ceph_vinop(inode), folio);
|
2021-08-31 13:39:13 -04:00
|
|
|
return ERR_PTR(-ESTALE);
|
2016-04-15 13:56:12 +08:00
|
|
|
}
|
|
|
|
|
2020-05-28 13:56:54 -04:00
|
|
|
for (;;) {
|
|
|
|
struct ceph_snap_context *snapc, *oldest;
|
|
|
|
|
2025-02-17 18:51:11 +00:00
|
|
|
folio_wait_writeback(folio);
|
2020-05-28 13:56:54 -04:00
|
|
|
|
2025-02-17 18:51:11 +00:00
|
|
|
snapc = page_snap_context(&folio->page);
|
2020-05-28 13:56:54 -04:00
|
|
|
if (!snapc || snapc == ci->i_head_snapc)
|
|
|
|
break;
|
2009-10-06 11:31:09 -07:00
|
|
|
|
|
|
|
/*
|
2025-02-17 18:51:11 +00:00
|
|
|
* this folio is already dirty in another (older) snap
|
2009-10-06 11:31:09 -07:00
|
|
|
* context! is it writeable now?
|
|
|
|
*/
|
2017-09-02 10:50:48 +08:00
|
|
|
oldest = get_oldest_context(inode, NULL, NULL);
|
2010-03-31 21:52:10 -07:00
|
|
|
if (snapc->seq > oldest->seq) {
|
2020-05-28 13:56:54 -04:00
|
|
|
/* not writeable -- return it for the caller to deal with */
|
2010-03-31 22:01:38 -07:00
|
|
|
ceph_put_snap_context(oldest);
|
2025-02-17 18:51:11 +00:00
|
|
|
doutc(cl, " %llx.%llx folio %p snapc %p not current or oldest\n",
|
|
|
|
ceph_vinop(inode), folio, snapc);
|
2020-05-28 13:56:54 -04:00
|
|
|
return ceph_get_snap_context(snapc);
|
2009-10-06 11:31:09 -07:00
|
|
|
}
|
2010-03-31 22:01:38 -07:00
|
|
|
ceph_put_snap_context(oldest);
|
2009-10-06 11:31:09 -07:00
|
|
|
|
2025-02-17 18:51:11 +00:00
|
|
|
/* yay, writeable, do it now (without dropping folio lock) */
|
|
|
|
doutc(cl, " %llx.%llx folio %p snapc %p not current, but oldest\n",
|
|
|
|
ceph_vinop(inode), folio, snapc);
|
|
|
|
if (folio_clear_dirty_for_io(folio)) {
|
2025-02-17 18:51:13 +00:00
|
|
|
int r = write_folio_nounlock(folio, NULL);
|
2020-05-28 13:56:54 -04:00
|
|
|
if (r < 0)
|
|
|
|
return ERR_PTR(r);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
2020-06-05 10:43:21 -04:00
|
|
|
static int ceph_netfs_check_write_begin(struct file *file, loff_t pos, unsigned int len,
|
2022-07-11 12:11:21 +08:00
|
|
|
struct folio **foliop, void **_fsdata)
|
2020-06-05 10:43:21 -04:00
|
|
|
{
|
|
|
|
struct inode *inode = file_inode(file);
|
|
|
|
struct ceph_inode_info *ci = ceph_inode(inode);
|
|
|
|
struct ceph_snap_context *snapc;
|
|
|
|
|
2025-02-17 18:51:11 +00:00
|
|
|
snapc = ceph_find_incompatible(*foliop);
|
2020-06-05 10:43:21 -04:00
|
|
|
if (snapc) {
|
|
|
|
int r;
|
|
|
|
|
2022-07-11 12:11:21 +08:00
|
|
|
folio_unlock(*foliop);
|
|
|
|
folio_put(*foliop);
|
|
|
|
*foliop = NULL;
|
2020-06-05 10:43:21 -04:00
|
|
|
if (IS_ERR(snapc))
|
|
|
|
return PTR_ERR(snapc);
|
|
|
|
|
|
|
|
ceph_queue_writeback(inode);
|
|
|
|
r = wait_event_killable(ci->i_cap_wq,
|
|
|
|
context_is_writeable_or_written(inode, snapc));
|
|
|
|
ceph_put_snap_context(snapc);
|
|
|
|
return r == 0 ? -EAGAIN : r;
|
|
|
|
}
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2020-05-28 13:56:54 -04:00
|
|
|
/*
|
|
|
|
* We are only allowed to write into/dirty the page if the page is
|
|
|
|
* clean, or already dirty within the same snap context.
|
|
|
|
*/
|
2025-07-16 09:36:06 +00:00
|
|
|
static int ceph_write_begin(const struct kiocb *iocb,
|
|
|
|
struct address_space *mapping,
|
2022-02-22 14:31:43 -05:00
|
|
|
loff_t pos, unsigned len,
|
2024-07-15 14:24:01 -04:00
|
|
|
struct folio **foliop, void **fsdata)
|
2020-05-28 13:56:54 -04:00
|
|
|
{
|
2025-07-16 09:36:06 +00:00
|
|
|
struct file *file = iocb->ki_filp;
|
2020-05-28 13:56:54 -04:00
|
|
|
struct inode *inode = file_inode(file);
|
2022-06-09 15:04:01 -07:00
|
|
|
struct ceph_inode_info *ci = ceph_inode(inode);
|
2020-06-05 10:43:21 -04:00
|
|
|
int r;
|
2009-10-06 11:31:09 -07:00
|
|
|
|
2024-07-15 14:24:01 -04:00
|
|
|
r = netfs_write_begin(&ci->netfs, file, inode->i_mapping, pos, len, foliop, NULL);
|
2022-07-05 10:40:23 +08:00
|
|
|
if (r < 0)
|
|
|
|
return r;
|
|
|
|
|
2024-07-15 14:24:01 -04:00
|
|
|
folio_wait_private_2(*foliop); /* [DEPRECATED] */
|
|
|
|
WARN_ON_ONCE(!folio_test_locked(*foliop));
|
2022-07-05 10:40:23 +08:00
|
|
|
return 0;
|
2010-02-09 11:02:51 -08:00
|
|
|
}
|
|
|
|
|
2009-10-06 11:31:09 -07:00
|
|
|
/*
|
|
|
|
* we don't do anything in here that simple_write_end doesn't do
|
2015-04-30 14:40:54 +08:00
|
|
|
* except adjust dirty page accounting
|
2009-10-06 11:31:09 -07:00
|
|
|
*/
|
2025-07-16 09:36:06 +00:00
|
|
|
static int ceph_write_end(const struct kiocb *iocb,
|
|
|
|
struct address_space *mapping, loff_t pos,
|
|
|
|
unsigned len, unsigned copied,
|
2024-07-10 15:45:32 -04:00
|
|
|
struct folio *folio, void *fsdata)
|
2009-10-06 11:31:09 -07:00
|
|
|
{
|
2025-07-16 09:36:06 +00:00
|
|
|
struct file *file = iocb->ki_filp;
|
2013-01-23 17:07:38 -05:00
|
|
|
struct inode *inode = file_inode(file);
|
2023-06-12 09:04:07 +08:00
|
|
|
struct ceph_client *cl = ceph_inode_to_client(inode);
|
2017-05-22 12:03:32 +08:00
|
|
|
bool check_cap = false;
|
2009-10-06 11:31:09 -07:00
|
|
|
|
2023-06-12 09:04:07 +08:00
|
|
|
doutc(cl, "%llx.%llx file %p folio %p %d~%d (%d)\n", ceph_vinop(inode),
|
|
|
|
file, folio, (int)pos, (int)copied, (int)len);
|
2009-10-06 11:31:09 -07:00
|
|
|
|
2021-08-11 09:49:13 +01:00
|
|
|
if (!folio_test_uptodate(folio)) {
|
2021-06-14 07:15:38 -04:00
|
|
|
/* just return that nothing was copied on a short copy */
|
2016-09-05 22:20:03 -04:00
|
|
|
if (copied < len) {
|
|
|
|
copied = 0;
|
|
|
|
goto out;
|
|
|
|
}
|
2021-08-11 09:49:13 +01:00
|
|
|
folio_mark_uptodate(folio);
|
2016-09-05 22:20:03 -04:00
|
|
|
}
|
2009-10-06 11:31:09 -07:00
|
|
|
|
|
|
|
/* did file size increase? */
|
2015-12-30 11:32:46 +08:00
|
|
|
if (pos+copied > i_size_read(inode))
|
2009-10-06 11:31:09 -07:00
|
|
|
check_cap = ceph_inode_set_size(inode, pos+copied);
|
|
|
|
|
2021-08-11 09:49:13 +01:00
|
|
|
folio_mark_dirty(folio);
|
2009-10-06 11:31:09 -07:00
|
|
|
|
2016-09-05 22:20:03 -04:00
|
|
|
out:
|
2021-08-11 09:49:13 +01:00
|
|
|
folio_unlock(folio);
|
|
|
|
folio_put(folio);
|
2009-10-06 11:31:09 -07:00
|
|
|
|
|
|
|
if (check_cap)
|
2022-10-18 09:03:29 +08:00
|
|
|
ceph_check_caps(ceph_inode(inode), CHECK_CAPS_AUTHONLY);
|
2009-10-06 11:31:09 -07:00
|
|
|
|
|
|
|
return copied;
|
|
|
|
}
|
|
|
|
|
|
|
|
const struct address_space_operations ceph_aops = {
|
2022-04-29 08:49:28 -04:00
|
|
|
.read_folio = netfs_read_folio,
|
2021-06-29 22:37:05 +01:00
|
|
|
.readahead = netfs_readahead,
|
2009-10-06 11:31:09 -07:00
|
|
|
.writepages = ceph_writepages_start,
|
|
|
|
.write_begin = ceph_write_begin,
|
|
|
|
.write_end = ceph_write_end,
|
2022-02-09 20:22:01 +00:00
|
|
|
.dirty_folio = ceph_dirty_folio,
|
2022-02-09 20:21:40 +00:00
|
|
|
.invalidate_folio = ceph_invalidate_folio,
|
2021-08-20 17:08:30 +01:00
|
|
|
.release_folio = netfs_release_folio,
|
2021-09-23 07:50:08 -04:00
|
|
|
.direct_IO = noop_direct_IO,
|
2025-02-17 18:51:09 +00:00
|
|
|
.migrate_folio = filemap_migrate_folio,
|
2009-10-06 11:31:09 -07:00
|
|
|
};
|
|
|
|
|
2016-05-10 18:40:28 +08:00
|
|
|
static void ceph_block_sigs(sigset_t *oldset)
|
|
|
|
{
|
|
|
|
sigset_t mask;
|
|
|
|
siginitsetinv(&mask, sigmask(SIGKILL));
|
|
|
|
sigprocmask(SIG_BLOCK, &mask, oldset);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void ceph_restore_sigs(sigset_t *oldset)
|
|
|
|
{
|
|
|
|
sigprocmask(SIG_SETMASK, oldset, NULL);
|
|
|
|
}
|
2009-10-06 11:31:09 -07:00
|
|
|
|
|
|
|
/*
|
|
|
|
* vm ops
|
|
|
|
*/
|
2018-07-23 21:32:24 +05:30
|
|
|
static vm_fault_t ceph_filemap_fault(struct vm_fault *vmf)
|
2013-11-28 14:28:14 +08:00
|
|
|
{
|
2017-02-24 14:56:41 -08:00
|
|
|
struct vm_area_struct *vma = vmf->vma;
|
2013-11-28 14:28:14 +08:00
|
|
|
struct inode *inode = file_inode(vma->vm_file);
|
|
|
|
struct ceph_inode_info *ci = ceph_inode(inode);
|
2023-06-12 09:04:07 +08:00
|
|
|
struct ceph_client *cl = ceph_inode_to_client(inode);
|
2013-11-28 14:28:14 +08:00
|
|
|
struct ceph_file_info *fi = vma->vm_file->private_data;
|
2020-10-04 19:04:24 +01:00
|
|
|
loff_t off = (loff_t)vmf->pgoff << PAGE_SHIFT;
|
2018-07-23 21:32:24 +05:30
|
|
|
int want, got, err;
|
2016-05-10 18:40:28 +08:00
|
|
|
sigset_t oldset;
|
2018-07-23 21:32:24 +05:30
|
|
|
vm_fault_t ret = VM_FAULT_SIGBUS;
|
2016-05-10 18:40:28 +08:00
|
|
|
|
2021-08-31 13:39:13 -04:00
|
|
|
if (ceph_inode_is_shutdown(inode))
|
|
|
|
return ret;
|
|
|
|
|
2016-05-10 18:40:28 +08:00
|
|
|
ceph_block_sigs(&oldset);
|
2013-11-28 14:28:14 +08:00
|
|
|
|
2023-06-12 09:04:07 +08:00
|
|
|
doutc(cl, "%llx.%llx %llu trying to get caps\n",
|
|
|
|
ceph_vinop(inode), off);
|
2013-11-28 14:28:14 +08:00
|
|
|
if (fi->fmode & CEPH_FILE_MODE_LAZY)
|
|
|
|
want = CEPH_CAP_FILE_CACHE | CEPH_CAP_FILE_LAZYIO;
|
|
|
|
else
|
|
|
|
want = CEPH_CAP_FILE_CACHE;
|
2016-05-10 18:40:28 +08:00
|
|
|
|
|
|
|
got = 0;
|
2021-04-05 12:19:35 -04:00
|
|
|
err = ceph_get_caps(vma->vm_file, CEPH_CAP_FILE_RD, want, -1, &got);
|
2018-07-23 21:32:24 +05:30
|
|
|
if (err < 0)
|
2016-05-10 18:40:28 +08:00
|
|
|
goto out_restore;
|
2016-05-10 18:59:13 +08:00
|
|
|
|
2023-06-12 09:04:07 +08:00
|
|
|
doutc(cl, "%llx.%llx %llu got cap refs on %s\n", ceph_vinop(inode),
|
|
|
|
off, ceph_cap_string(got));
|
2013-11-28 14:28:14 +08:00
|
|
|
|
2014-11-14 22:36:18 +08:00
|
|
|
if ((got & (CEPH_CAP_FILE_CACHE | CEPH_CAP_FILE_LAZYIO)) ||
|
2022-06-07 10:13:53 +08:00
|
|
|
!ceph_has_inline_data(ci)) {
|
2017-12-15 11:15:36 +08:00
|
|
|
CEPH_DEFINE_RW_CONTEXT(rw_ctx, got);
|
|
|
|
ceph_add_rw_context(fi, &rw_ctx);
|
2017-02-24 14:56:41 -08:00
|
|
|
ret = filemap_fault(vmf);
|
2017-12-15 11:15:36 +08:00
|
|
|
ceph_del_rw_context(fi, &rw_ctx);
|
2023-06-12 09:04:07 +08:00
|
|
|
doutc(cl, "%llx.%llx %llu drop cap refs %s ret %x\n",
|
|
|
|
ceph_vinop(inode), off, ceph_cap_string(got), ret);
|
2016-10-25 10:51:55 +08:00
|
|
|
} else
|
2018-07-23 21:32:24 +05:30
|
|
|
err = -EAGAIN;
|
2013-11-28 14:28:14 +08:00
|
|
|
|
|
|
|
ceph_put_cap_refs(ci, got);
|
|
|
|
|
2018-07-23 21:32:24 +05:30
|
|
|
if (err != -EAGAIN)
|
2016-05-10 18:40:28 +08:00
|
|
|
goto out_restore;
|
2014-11-14 22:36:18 +08:00
|
|
|
|
|
|
|
/* read inline data */
|
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 15:29:47 +03:00
|
|
|
if (off >= PAGE_SIZE) {
|
2014-11-14 22:36:18 +08:00
|
|
|
/* does not support inline data > PAGE_SIZE */
|
|
|
|
ret = VM_FAULT_SIGBUS;
|
|
|
|
} else {
|
|
|
|
struct address_space *mapping = inode->i_mapping;
|
2021-04-22 16:38:26 +02:00
|
|
|
struct page *page;
|
|
|
|
|
|
|
|
filemap_invalidate_lock_shared(mapping);
|
|
|
|
page = find_or_create_page(mapping, 0,
|
|
|
|
mapping_gfp_constraint(mapping, ~__GFP_FS));
|
2014-11-14 22:36:18 +08:00
|
|
|
if (!page) {
|
|
|
|
ret = VM_FAULT_OOM;
|
2016-05-10 18:40:28 +08:00
|
|
|
goto out_inline;
|
2014-11-14 22:36:18 +08:00
|
|
|
}
|
2018-07-23 21:32:24 +05:30
|
|
|
err = __ceph_do_getattr(inode, page,
|
2014-11-14 22:36:18 +08:00
|
|
|
CEPH_STAT_CAP_INLINE_DATA, true);
|
2018-07-23 21:32:24 +05:30
|
|
|
if (err < 0 || off >= i_size_read(inode)) {
|
2014-11-14 22:36:18 +08:00
|
|
|
unlock_page(page);
|
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 15:29:47 +03:00
|
|
|
put_page(page);
|
2019-01-05 01:00:29 +05:30
|
|
|
ret = vmf_error(err);
|
2016-05-10 18:40:28 +08:00
|
|
|
goto out_inline;
|
2014-11-14 22:36:18 +08:00
|
|
|
}
|
2018-07-23 21:32:24 +05:30
|
|
|
if (err < PAGE_SIZE)
|
|
|
|
zero_user_segment(page, err, PAGE_SIZE);
|
2014-11-14 22:36:18 +08:00
|
|
|
else
|
|
|
|
flush_dcache_page(page);
|
|
|
|
SetPageUptodate(page);
|
|
|
|
vmf->page = page;
|
|
|
|
ret = VM_FAULT_MAJOR | VM_FAULT_LOCKED;
|
2016-05-10 18:40:28 +08:00
|
|
|
out_inline:
|
2021-04-22 16:38:26 +02:00
|
|
|
filemap_invalidate_unlock_shared(mapping);
|
2023-06-12 09:04:07 +08:00
|
|
|
doutc(cl, "%llx.%llx %llu read inline data ret %x\n",
|
|
|
|
ceph_vinop(inode), off, ret);
|
2014-11-14 22:36:18 +08:00
|
|
|
}
|
2016-05-10 18:40:28 +08:00
|
|
|
out_restore:
|
|
|
|
ceph_restore_sigs(&oldset);
|
2018-07-23 21:32:24 +05:30
|
|
|
if (err < 0)
|
|
|
|
ret = vmf_error(err);
|
2016-05-10 18:59:13 +08:00
|
|
|
|
2013-11-28 14:28:14 +08:00
|
|
|
return ret;
|
|
|
|
}
|
2009-10-06 11:31:09 -07:00
|
|
|
|
2018-07-23 21:32:24 +05:30
|
|
|
static vm_fault_t ceph_page_mkwrite(struct vm_fault *vmf)
|
2009-10-06 11:31:09 -07:00
|
|
|
{
|
2017-02-24 14:56:41 -08:00
|
|
|
struct vm_area_struct *vma = vmf->vma;
|
2013-01-23 17:07:38 -05:00
|
|
|
struct inode *inode = file_inode(vma->vm_file);
|
2023-06-12 09:04:07 +08:00
|
|
|
struct ceph_client *cl = ceph_inode_to_client(inode);
|
2013-11-28 14:28:14 +08:00
|
|
|
struct ceph_inode_info *ci = ceph_inode(inode);
|
|
|
|
struct ceph_file_info *fi = vma->vm_file->private_data;
|
2015-06-10 17:26:13 +08:00
|
|
|
struct ceph_cap_flush *prealloc_cf;
|
2025-02-17 18:51:10 +00:00
|
|
|
struct folio *folio = page_folio(vmf->page);
|
|
|
|
loff_t off = folio_pos(folio);
|
2013-11-28 14:28:14 +08:00
|
|
|
loff_t size = i_size_read(inode);
|
|
|
|
size_t len;
|
2018-07-23 21:32:24 +05:30
|
|
|
int want, got, err;
|
2016-05-10 18:40:28 +08:00
|
|
|
sigset_t oldset;
|
2018-07-23 21:32:24 +05:30
|
|
|
vm_fault_t ret = VM_FAULT_SIGBUS;
|
2012-06-12 16:20:24 +02:00
|
|
|
|
2021-08-31 13:39:13 -04:00
|
|
|
if (ceph_inode_is_shutdown(inode))
|
|
|
|
return ret;
|
|
|
|
|
2015-06-10 17:26:13 +08:00
|
|
|
prealloc_cf = ceph_alloc_cap_flush();
|
|
|
|
if (!prealloc_cf)
|
2016-05-10 18:59:13 +08:00
|
|
|
return VM_FAULT_OOM;
|
2015-06-10 17:26:13 +08:00
|
|
|
|
2019-08-01 10:06:40 -04:00
|
|
|
sb_start_pagefault(inode->i_sb);
|
2016-05-10 18:40:28 +08:00
|
|
|
ceph_block_sigs(&oldset);
|
2015-06-10 17:26:13 +08:00
|
|
|
|
2025-02-17 18:51:10 +00:00
|
|
|
if (off + folio_size(folio) <= size)
|
|
|
|
len = folio_size(folio);
|
2009-10-06 11:31:09 -07:00
|
|
|
else
|
2025-02-17 18:51:10 +00:00
|
|
|
len = offset_in_folio(folio, size);
|
2009-10-06 11:31:09 -07:00
|
|
|
|
2023-06-12 09:04:07 +08:00
|
|
|
doutc(cl, "%llx.%llx %llu~%zd getting caps i_size %llu\n",
|
|
|
|
ceph_vinop(inode), off, len, size);
|
2013-11-28 14:28:14 +08:00
|
|
|
if (fi->fmode & CEPH_FILE_MODE_LAZY)
|
|
|
|
want = CEPH_CAP_FILE_BUFFER | CEPH_CAP_FILE_LAZYIO;
|
|
|
|
else
|
|
|
|
want = CEPH_CAP_FILE_BUFFER;
|
2016-05-10 18:40:28 +08:00
|
|
|
|
|
|
|
got = 0;
|
2021-04-05 12:19:35 -04:00
|
|
|
err = ceph_get_caps(vma->vm_file, CEPH_CAP_FILE_WR, want, off + len, &got);
|
2018-07-23 21:32:24 +05:30
|
|
|
if (err < 0)
|
2016-05-10 18:40:28 +08:00
|
|
|
goto out_free;
|
2016-05-10 18:59:13 +08:00
|
|
|
|
2023-06-12 09:04:07 +08:00
|
|
|
doutc(cl, "%llx.%llx %llu~%zd got cap refs on %s\n", ceph_vinop(inode),
|
|
|
|
off, len, ceph_cap_string(got));
|
2013-11-28 14:28:14 +08:00
|
|
|
|
2025-02-17 18:51:10 +00:00
|
|
|
/* Update time before taking folio lock */
|
2013-11-28 14:28:14 +08:00
|
|
|
file_update_time(vma->vm_file);
|
2019-06-06 08:57:27 -04:00
|
|
|
inode_inc_iversion_raw(inode);
|
2010-02-09 11:02:51 -08:00
|
|
|
|
2016-05-10 19:09:06 +08:00
|
|
|
do {
|
2020-05-28 14:59:49 -04:00
|
|
|
struct ceph_snap_context *snapc;
|
|
|
|
|
2025-02-17 18:51:10 +00:00
|
|
|
folio_lock(folio);
|
2010-02-09 11:02:51 -08:00
|
|
|
|
2025-02-17 18:51:10 +00:00
|
|
|
if (folio_mkwrite_check_truncate(folio, inode) < 0) {
|
|
|
|
folio_unlock(folio);
|
2016-05-10 19:09:06 +08:00
|
|
|
ret = VM_FAULT_NOPAGE;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
2025-02-17 18:51:11 +00:00
|
|
|
snapc = ceph_find_incompatible(folio);
|
2020-05-28 14:59:49 -04:00
|
|
|
if (!snapc) {
|
2025-02-17 18:51:10 +00:00
|
|
|
/* success. we'll keep the folio locked. */
|
|
|
|
folio_mark_dirty(folio);
|
2016-05-10 19:09:06 +08:00
|
|
|
ret = VM_FAULT_LOCKED;
|
2020-05-28 14:59:49 -04:00
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
2025-02-17 18:51:10 +00:00
|
|
|
folio_unlock(folio);
|
2020-05-28 14:59:49 -04:00
|
|
|
|
|
|
|
if (IS_ERR(snapc)) {
|
|
|
|
ret = VM_FAULT_SIGBUS;
|
|
|
|
break;
|
2016-05-10 19:09:06 +08:00
|
|
|
}
|
2020-05-28 14:59:49 -04:00
|
|
|
|
|
|
|
ceph_queue_writeback(inode);
|
|
|
|
err = wait_event_killable(ci->i_cap_wq,
|
|
|
|
context_is_writeable_or_written(inode, snapc));
|
|
|
|
ceph_put_snap_context(snapc);
|
|
|
|
} while (err == 0);
|
2010-02-09 11:02:51 -08:00
|
|
|
|
2021-12-15 23:48:33 +00:00
|
|
|
if (ret == VM_FAULT_LOCKED) {
|
2013-11-28 14:28:14 +08:00
|
|
|
int dirty;
|
|
|
|
spin_lock(&ci->i_ceph_lock);
|
2015-06-10 17:26:13 +08:00
|
|
|
dirty = __ceph_mark_dirty_caps(ci, CEPH_CAP_FILE_WR,
|
|
|
|
&prealloc_cf);
|
2013-11-28 14:28:14 +08:00
|
|
|
spin_unlock(&ci->i_ceph_lock);
|
|
|
|
if (dirty)
|
|
|
|
__mark_inode_dirty(inode, dirty);
|
|
|
|
}
|
|
|
|
|
2023-06-12 09:04:07 +08:00
|
|
|
doutc(cl, "%llx.%llx %llu~%zd dropping cap refs on %s ret %x\n",
|
|
|
|
ceph_vinop(inode), off, len, ceph_cap_string(got), ret);
|
2020-12-10 14:39:26 -05:00
|
|
|
ceph_put_cap_refs_async(ci, got);
|
2015-06-10 17:26:13 +08:00
|
|
|
out_free:
|
2016-05-10 18:40:28 +08:00
|
|
|
ceph_restore_sigs(&oldset);
|
2019-08-01 10:06:40 -04:00
|
|
|
sb_end_pagefault(inode->i_sb);
|
2015-06-10 17:26:13 +08:00
|
|
|
ceph_free_cap_flush(prealloc_cf);
|
2018-07-23 21:32:24 +05:30
|
|
|
if (err < 0)
|
|
|
|
ret = vmf_error(err);
|
2009-10-06 11:31:09 -07:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2014-11-14 21:41:55 +08:00
|
|
|
void ceph_fill_inline_data(struct inode *inode, struct page *locked_page,
|
|
|
|
char *data, size_t len)
|
|
|
|
{
|
2023-06-12 09:04:07 +08:00
|
|
|
struct ceph_client *cl = ceph_inode_to_client(inode);
|
2014-11-14 21:41:55 +08:00
|
|
|
struct address_space *mapping = inode->i_mapping;
|
|
|
|
struct page *page;
|
|
|
|
|
|
|
|
if (locked_page) {
|
|
|
|
page = locked_page;
|
|
|
|
} else {
|
|
|
|
if (i_size_read(inode) == 0)
|
|
|
|
return;
|
|
|
|
page = find_or_create_page(mapping, 0,
|
2015-11-06 16:28:49 -08:00
|
|
|
mapping_gfp_constraint(mapping,
|
|
|
|
~__GFP_FS));
|
2014-11-14 21:41:55 +08:00
|
|
|
if (!page)
|
|
|
|
return;
|
|
|
|
if (PageUptodate(page)) {
|
|
|
|
unlock_page(page);
|
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 15:29:47 +03:00
|
|
|
put_page(page);
|
2014-11-14 21:41:55 +08:00
|
|
|
return;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2023-06-12 09:04:07 +08:00
|
|
|
doutc(cl, "%p %llx.%llx len %zu locked_page %p\n", inode,
|
|
|
|
ceph_vinop(inode), len, locked_page);
|
2014-11-14 21:41:55 +08:00
|
|
|
|
|
|
|
if (len > 0) {
|
|
|
|
void *kaddr = kmap_atomic(page);
|
|
|
|
memcpy(kaddr, data, len);
|
|
|
|
kunmap_atomic(kaddr);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (page != locked_page) {
|
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 15:29:47 +03:00
|
|
|
if (len < PAGE_SIZE)
|
|
|
|
zero_user_segment(page, len, PAGE_SIZE);
|
2014-11-14 21:41:55 +08:00
|
|
|
else
|
|
|
|
flush_dcache_page(page);
|
|
|
|
|
|
|
|
SetPageUptodate(page);
|
|
|
|
unlock_page(page);
|
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 15:29:47 +03:00
|
|
|
put_page(page);
|
2014-11-14 21:41:55 +08:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2021-12-15 23:48:33 +00:00
|
|
|
int ceph_uninline_data(struct file *file)
|
2014-11-14 22:38:29 +08:00
|
|
|
{
|
2021-12-15 23:48:33 +00:00
|
|
|
struct inode *inode = file_inode(file);
|
2014-11-14 22:38:29 +08:00
|
|
|
struct ceph_inode_info *ci = ceph_inode(inode);
|
2023-06-12 10:50:38 +08:00
|
|
|
struct ceph_fs_client *fsc = ceph_inode_to_fs_client(inode);
|
2023-06-12 09:04:07 +08:00
|
|
|
struct ceph_client *cl = fsc->client;
|
ceph: fix possible deadlock when holding Fwb to get inline_data
1, mount with wsync.
2, create a file with O_RDWR, and the request was sent to mds.0:
ceph_atomic_open()-->
ceph_mdsc_do_request(openc)
finish_open(file, dentry, ceph_open)-->
ceph_open()-->
ceph_init_file()-->
ceph_init_file_info()-->
ceph_uninline_data()-->
{
...
if (inline_version == 1 || /* initial version, no data */
inline_version == CEPH_INLINE_NONE)
goto out_unlock;
...
}
The inline_version will be 1, which is the initial version for the
new create file. And here the ci->i_inline_version will keep with 1,
it's buggy.
3, buffer write to the file immediately:
ceph_write_iter()-->
ceph_get_caps(file, need=Fw, want=Fb, ...);
generic_perform_write()-->
a_ops->write_begin()-->
ceph_write_begin()-->
netfs_write_begin()-->
netfs_begin_read()-->
netfs_rreq_submit_slice()-->
netfs_read_from_server()-->
rreq->netfs_ops->issue_read()-->
ceph_netfs_issue_read()-->
{
...
if (ci->i_inline_version != CEPH_INLINE_NONE &&
ceph_netfs_issue_op_inline(subreq))
return;
...
}
ceph_put_cap_refs(ci, Fwb);
The ceph_netfs_issue_op_inline() will send a getattr(Fsr) request to
mds.1.
4, then the mds.1 will request the rd lock for CInode::filelock from
the auth mds.0, the mds.0 will do the CInode::filelock state transation
from excl --> sync, but it need to revoke the Fxwb caps back from the
clients.
While the kernel client has aleady held the Fwb caps and waiting for
the getattr(Fsr).
It's deadlock!
URL: https://tracker.ceph.com/issues/55377
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2022-04-25 16:08:24 +08:00
|
|
|
struct ceph_osd_request *req = NULL;
|
2023-02-01 09:36:45 +08:00
|
|
|
struct ceph_cap_flush *prealloc_cf = NULL;
|
2021-12-15 23:48:33 +00:00
|
|
|
struct folio *folio = NULL;
|
2022-03-07 17:21:21 +03:00
|
|
|
u64 inline_version = CEPH_INLINE_NONE;
|
2021-12-15 23:48:33 +00:00
|
|
|
struct page *pages[1];
|
2014-11-14 22:38:29 +08:00
|
|
|
int err = 0;
|
2022-03-07 17:21:21 +03:00
|
|
|
u64 len;
|
2021-12-15 23:48:33 +00:00
|
|
|
|
ceph: fix possible deadlock when holding Fwb to get inline_data
1, mount with wsync.
2, create a file with O_RDWR, and the request was sent to mds.0:
ceph_atomic_open()-->
ceph_mdsc_do_request(openc)
finish_open(file, dentry, ceph_open)-->
ceph_open()-->
ceph_init_file()-->
ceph_init_file_info()-->
ceph_uninline_data()-->
{
...
if (inline_version == 1 || /* initial version, no data */
inline_version == CEPH_INLINE_NONE)
goto out_unlock;
...
}
The inline_version will be 1, which is the initial version for the
new create file. And here the ci->i_inline_version will keep with 1,
it's buggy.
3, buffer write to the file immediately:
ceph_write_iter()-->
ceph_get_caps(file, need=Fw, want=Fb, ...);
generic_perform_write()-->
a_ops->write_begin()-->
ceph_write_begin()-->
netfs_write_begin()-->
netfs_begin_read()-->
netfs_rreq_submit_slice()-->
netfs_read_from_server()-->
rreq->netfs_ops->issue_read()-->
ceph_netfs_issue_read()-->
{
...
if (ci->i_inline_version != CEPH_INLINE_NONE &&
ceph_netfs_issue_op_inline(subreq))
return;
...
}
ceph_put_cap_refs(ci, Fwb);
The ceph_netfs_issue_op_inline() will send a getattr(Fsr) request to
mds.1.
4, then the mds.1 will request the rd lock for CInode::filelock from
the auth mds.0, the mds.0 will do the CInode::filelock state transation
from excl --> sync, but it need to revoke the Fxwb caps back from the
clients.
While the kernel client has aleady held the Fwb caps and waiting for
the getattr(Fsr).
It's deadlock!
URL: https://tracker.ceph.com/issues/55377
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2022-04-25 16:08:24 +08:00
|
|
|
spin_lock(&ci->i_ceph_lock);
|
|
|
|
inline_version = ci->i_inline_version;
|
|
|
|
spin_unlock(&ci->i_ceph_lock);
|
|
|
|
|
2023-06-12 09:04:07 +08:00
|
|
|
doutc(cl, "%llx.%llx inline_version %llu\n", ceph_vinop(inode),
|
|
|
|
inline_version);
|
ceph: fix possible deadlock when holding Fwb to get inline_data
1, mount with wsync.
2, create a file with O_RDWR, and the request was sent to mds.0:
ceph_atomic_open()-->
ceph_mdsc_do_request(openc)
finish_open(file, dentry, ceph_open)-->
ceph_open()-->
ceph_init_file()-->
ceph_init_file_info()-->
ceph_uninline_data()-->
{
...
if (inline_version == 1 || /* initial version, no data */
inline_version == CEPH_INLINE_NONE)
goto out_unlock;
...
}
The inline_version will be 1, which is the initial version for the
new create file. And here the ci->i_inline_version will keep with 1,
it's buggy.
3, buffer write to the file immediately:
ceph_write_iter()-->
ceph_get_caps(file, need=Fw, want=Fb, ...);
generic_perform_write()-->
a_ops->write_begin()-->
ceph_write_begin()-->
netfs_write_begin()-->
netfs_begin_read()-->
netfs_rreq_submit_slice()-->
netfs_read_from_server()-->
rreq->netfs_ops->issue_read()-->
ceph_netfs_issue_read()-->
{
...
if (ci->i_inline_version != CEPH_INLINE_NONE &&
ceph_netfs_issue_op_inline(subreq))
return;
...
}
ceph_put_cap_refs(ci, Fwb);
The ceph_netfs_issue_op_inline() will send a getattr(Fsr) request to
mds.1.
4, then the mds.1 will request the rd lock for CInode::filelock from
the auth mds.0, the mds.0 will do the CInode::filelock state transation
from excl --> sync, but it need to revoke the Fxwb caps back from the
clients.
While the kernel client has aleady held the Fwb caps and waiting for
the getattr(Fsr).
It's deadlock!
URL: https://tracker.ceph.com/issues/55377
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2022-04-25 16:08:24 +08:00
|
|
|
|
2023-02-01 09:36:45 +08:00
|
|
|
if (ceph_inode_is_shutdown(inode)) {
|
|
|
|
err = -EIO;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
ceph: fix possible deadlock when holding Fwb to get inline_data
1, mount with wsync.
2, create a file with O_RDWR, and the request was sent to mds.0:
ceph_atomic_open()-->
ceph_mdsc_do_request(openc)
finish_open(file, dentry, ceph_open)-->
ceph_open()-->
ceph_init_file()-->
ceph_init_file_info()-->
ceph_uninline_data()-->
{
...
if (inline_version == 1 || /* initial version, no data */
inline_version == CEPH_INLINE_NONE)
goto out_unlock;
...
}
The inline_version will be 1, which is the initial version for the
new create file. And here the ci->i_inline_version will keep with 1,
it's buggy.
3, buffer write to the file immediately:
ceph_write_iter()-->
ceph_get_caps(file, need=Fw, want=Fb, ...);
generic_perform_write()-->
a_ops->write_begin()-->
ceph_write_begin()-->
netfs_write_begin()-->
netfs_begin_read()-->
netfs_rreq_submit_slice()-->
netfs_read_from_server()-->
rreq->netfs_ops->issue_read()-->
ceph_netfs_issue_read()-->
{
...
if (ci->i_inline_version != CEPH_INLINE_NONE &&
ceph_netfs_issue_op_inline(subreq))
return;
...
}
ceph_put_cap_refs(ci, Fwb);
The ceph_netfs_issue_op_inline() will send a getattr(Fsr) request to
mds.1.
4, then the mds.1 will request the rd lock for CInode::filelock from
the auth mds.0, the mds.0 will do the CInode::filelock state transation
from excl --> sync, but it need to revoke the Fxwb caps back from the
clients.
While the kernel client has aleady held the Fwb caps and waiting for
the getattr(Fsr).
It's deadlock!
URL: https://tracker.ceph.com/issues/55377
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2022-04-25 16:08:24 +08:00
|
|
|
if (inline_version == CEPH_INLINE_NONE)
|
|
|
|
return 0;
|
|
|
|
|
2021-12-15 23:48:33 +00:00
|
|
|
prealloc_cf = ceph_alloc_cap_flush();
|
|
|
|
if (!prealloc_cf)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
ceph: fix possible deadlock when holding Fwb to get inline_data
1, mount with wsync.
2, create a file with O_RDWR, and the request was sent to mds.0:
ceph_atomic_open()-->
ceph_mdsc_do_request(openc)
finish_open(file, dentry, ceph_open)-->
ceph_open()-->
ceph_init_file()-->
ceph_init_file_info()-->
ceph_uninline_data()-->
{
...
if (inline_version == 1 || /* initial version, no data */
inline_version == CEPH_INLINE_NONE)
goto out_unlock;
...
}
The inline_version will be 1, which is the initial version for the
new create file. And here the ci->i_inline_version will keep with 1,
it's buggy.
3, buffer write to the file immediately:
ceph_write_iter()-->
ceph_get_caps(file, need=Fw, want=Fb, ...);
generic_perform_write()-->
a_ops->write_begin()-->
ceph_write_begin()-->
netfs_write_begin()-->
netfs_begin_read()-->
netfs_rreq_submit_slice()-->
netfs_read_from_server()-->
rreq->netfs_ops->issue_read()-->
ceph_netfs_issue_read()-->
{
...
if (ci->i_inline_version != CEPH_INLINE_NONE &&
ceph_netfs_issue_op_inline(subreq))
return;
...
}
ceph_put_cap_refs(ci, Fwb);
The ceph_netfs_issue_op_inline() will send a getattr(Fsr) request to
mds.1.
4, then the mds.1 will request the rd lock for CInode::filelock from
the auth mds.0, the mds.0 will do the CInode::filelock state transation
from excl --> sync, but it need to revoke the Fxwb caps back from the
clients.
While the kernel client has aleady held the Fwb caps and waiting for
the getattr(Fsr).
It's deadlock!
URL: https://tracker.ceph.com/issues/55377
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2022-04-25 16:08:24 +08:00
|
|
|
if (inline_version == 1) /* initial version, no data */
|
|
|
|
goto out_uninline;
|
|
|
|
|
2021-12-15 23:48:33 +00:00
|
|
|
folio = read_mapping_folio(inode->i_mapping, 0, file);
|
|
|
|
if (IS_ERR(folio)) {
|
|
|
|
err = PTR_ERR(folio);
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
folio_lock(folio);
|
2014-11-14 22:38:29 +08:00
|
|
|
|
2021-12-15 23:48:33 +00:00
|
|
|
len = i_size_read(inode);
|
|
|
|
if (len > folio_size(folio))
|
|
|
|
len = folio_size(folio);
|
2014-11-14 22:38:29 +08:00
|
|
|
|
|
|
|
req = ceph_osdc_new_request(&fsc->client->osdc, &ci->i_layout,
|
|
|
|
ceph_vino(inode), 0, &len, 0, 1,
|
2017-02-11 18:48:41 +01:00
|
|
|
CEPH_OSD_OP_CREATE, CEPH_OSD_FLAG_WRITE,
|
2016-02-16 15:00:24 +01:00
|
|
|
NULL, 0, 0, false);
|
2014-11-14 22:38:29 +08:00
|
|
|
if (IS_ERR(req)) {
|
|
|
|
err = PTR_ERR(req);
|
2021-12-15 23:48:33 +00:00
|
|
|
goto out_unlock;
|
2014-11-14 22:38:29 +08:00
|
|
|
}
|
|
|
|
|
2023-10-04 14:52:09 -04:00
|
|
|
req->r_mtime = inode_get_mtime(inode);
|
2022-06-30 16:21:50 -04:00
|
|
|
ceph_osdc_start_request(&fsc->client->osdc, req);
|
|
|
|
err = ceph_osdc_wait_request(&fsc->client->osdc, req);
|
2014-11-14 22:38:29 +08:00
|
|
|
ceph_osdc_put_request(req);
|
|
|
|
if (err < 0)
|
2021-12-15 23:48:33 +00:00
|
|
|
goto out_unlock;
|
2014-11-14 22:38:29 +08:00
|
|
|
|
|
|
|
req = ceph_osdc_new_request(&fsc->client->osdc, &ci->i_layout,
|
|
|
|
ceph_vino(inode), 0, &len, 1, 3,
|
2017-02-11 18:48:41 +01:00
|
|
|
CEPH_OSD_OP_WRITE, CEPH_OSD_FLAG_WRITE,
|
2016-02-16 15:00:24 +01:00
|
|
|
NULL, ci->i_truncate_seq,
|
|
|
|
ci->i_truncate_size, false);
|
2014-11-14 22:38:29 +08:00
|
|
|
if (IS_ERR(req)) {
|
|
|
|
err = PTR_ERR(req);
|
2021-12-15 23:48:33 +00:00
|
|
|
goto out_unlock;
|
2014-11-14 22:38:29 +08:00
|
|
|
}
|
|
|
|
|
2021-12-15 23:48:33 +00:00
|
|
|
pages[0] = folio_page(folio, 0);
|
|
|
|
osd_req_op_extent_osd_data_pages(req, 1, pages, len, 0, false, false);
|
2014-11-14 22:38:29 +08:00
|
|
|
|
2015-04-13 11:25:07 +08:00
|
|
|
{
|
|
|
|
__le64 xattr_buf = cpu_to_le64(inline_version);
|
|
|
|
err = osd_req_op_xattr_init(req, 0, CEPH_OSD_OP_CMPXATTR,
|
|
|
|
"inline_version", &xattr_buf,
|
|
|
|
sizeof(xattr_buf),
|
|
|
|
CEPH_OSD_CMPXATTR_OP_GT,
|
|
|
|
CEPH_OSD_CMPXATTR_MODE_U64);
|
|
|
|
if (err)
|
2021-12-15 23:48:33 +00:00
|
|
|
goto out_put_req;
|
2015-04-13 11:25:07 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
{
|
|
|
|
char xattr_buf[32];
|
|
|
|
int xattr_len = snprintf(xattr_buf, sizeof(xattr_buf),
|
|
|
|
"%llu", inline_version);
|
|
|
|
err = osd_req_op_xattr_init(req, 2, CEPH_OSD_OP_SETXATTR,
|
|
|
|
"inline_version",
|
|
|
|
xattr_buf, xattr_len, 0, 0);
|
|
|
|
if (err)
|
2021-12-15 23:48:33 +00:00
|
|
|
goto out_put_req;
|
2015-04-13 11:25:07 +08:00
|
|
|
}
|
2014-11-14 22:38:29 +08:00
|
|
|
|
2023-10-04 14:52:09 -04:00
|
|
|
req->r_mtime = inode_get_mtime(inode);
|
2022-06-30 16:21:50 -04:00
|
|
|
ceph_osdc_start_request(&fsc->client->osdc, req);
|
|
|
|
err = ceph_osdc_wait_request(&fsc->client->osdc, req);
|
2020-03-19 23:45:01 -04:00
|
|
|
|
2021-03-22 20:28:49 +08:00
|
|
|
ceph_update_write_metrics(&fsc->mdsc->metric, req->r_start_latency,
|
2021-05-13 09:40:53 +08:00
|
|
|
req->r_end_latency, len, err);
|
2020-03-19 23:45:01 -04:00
|
|
|
|
ceph: fix possible deadlock when holding Fwb to get inline_data
1, mount with wsync.
2, create a file with O_RDWR, and the request was sent to mds.0:
ceph_atomic_open()-->
ceph_mdsc_do_request(openc)
finish_open(file, dentry, ceph_open)-->
ceph_open()-->
ceph_init_file()-->
ceph_init_file_info()-->
ceph_uninline_data()-->
{
...
if (inline_version == 1 || /* initial version, no data */
inline_version == CEPH_INLINE_NONE)
goto out_unlock;
...
}
The inline_version will be 1, which is the initial version for the
new create file. And here the ci->i_inline_version will keep with 1,
it's buggy.
3, buffer write to the file immediately:
ceph_write_iter()-->
ceph_get_caps(file, need=Fw, want=Fb, ...);
generic_perform_write()-->
a_ops->write_begin()-->
ceph_write_begin()-->
netfs_write_begin()-->
netfs_begin_read()-->
netfs_rreq_submit_slice()-->
netfs_read_from_server()-->
rreq->netfs_ops->issue_read()-->
ceph_netfs_issue_read()-->
{
...
if (ci->i_inline_version != CEPH_INLINE_NONE &&
ceph_netfs_issue_op_inline(subreq))
return;
...
}
ceph_put_cap_refs(ci, Fwb);
The ceph_netfs_issue_op_inline() will send a getattr(Fsr) request to
mds.1.
4, then the mds.1 will request the rd lock for CInode::filelock from
the auth mds.0, the mds.0 will do the CInode::filelock state transation
from excl --> sync, but it need to revoke the Fxwb caps back from the
clients.
While the kernel client has aleady held the Fwb caps and waiting for
the getattr(Fsr).
It's deadlock!
URL: https://tracker.ceph.com/issues/55377
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2022-04-25 16:08:24 +08:00
|
|
|
out_uninline:
|
2021-12-15 23:48:33 +00:00
|
|
|
if (!err) {
|
|
|
|
int dirty;
|
|
|
|
|
|
|
|
/* Set to CAP_INLINE_NONE and dirty the caps */
|
|
|
|
down_read(&fsc->mdsc->snap_rwsem);
|
|
|
|
spin_lock(&ci->i_ceph_lock);
|
|
|
|
ci->i_inline_version = CEPH_INLINE_NONE;
|
|
|
|
dirty = __ceph_mark_dirty_caps(ci, CEPH_CAP_FILE_WR, &prealloc_cf);
|
|
|
|
spin_unlock(&ci->i_ceph_lock);
|
|
|
|
up_read(&fsc->mdsc->snap_rwsem);
|
|
|
|
if (dirty)
|
|
|
|
__mark_inode_dirty(inode, dirty);
|
|
|
|
}
|
|
|
|
out_put_req:
|
2014-11-14 22:38:29 +08:00
|
|
|
ceph_osdc_put_request(req);
|
|
|
|
if (err == -ECANCELED)
|
|
|
|
err = 0;
|
2021-12-15 23:48:33 +00:00
|
|
|
out_unlock:
|
ceph: fix possible deadlock when holding Fwb to get inline_data
1, mount with wsync.
2, create a file with O_RDWR, and the request was sent to mds.0:
ceph_atomic_open()-->
ceph_mdsc_do_request(openc)
finish_open(file, dentry, ceph_open)-->
ceph_open()-->
ceph_init_file()-->
ceph_init_file_info()-->
ceph_uninline_data()-->
{
...
if (inline_version == 1 || /* initial version, no data */
inline_version == CEPH_INLINE_NONE)
goto out_unlock;
...
}
The inline_version will be 1, which is the initial version for the
new create file. And here the ci->i_inline_version will keep with 1,
it's buggy.
3, buffer write to the file immediately:
ceph_write_iter()-->
ceph_get_caps(file, need=Fw, want=Fb, ...);
generic_perform_write()-->
a_ops->write_begin()-->
ceph_write_begin()-->
netfs_write_begin()-->
netfs_begin_read()-->
netfs_rreq_submit_slice()-->
netfs_read_from_server()-->
rreq->netfs_ops->issue_read()-->
ceph_netfs_issue_read()-->
{
...
if (ci->i_inline_version != CEPH_INLINE_NONE &&
ceph_netfs_issue_op_inline(subreq))
return;
...
}
ceph_put_cap_refs(ci, Fwb);
The ceph_netfs_issue_op_inline() will send a getattr(Fsr) request to
mds.1.
4, then the mds.1 will request the rd lock for CInode::filelock from
the auth mds.0, the mds.0 will do the CInode::filelock state transation
from excl --> sync, but it need to revoke the Fxwb caps back from the
clients.
While the kernel client has aleady held the Fwb caps and waiting for
the getattr(Fsr).
It's deadlock!
URL: https://tracker.ceph.com/issues/55377
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2022-04-25 16:08:24 +08:00
|
|
|
if (folio) {
|
|
|
|
folio_unlock(folio);
|
|
|
|
folio_put(folio);
|
|
|
|
}
|
2014-11-14 22:38:29 +08:00
|
|
|
out:
|
2021-12-15 23:48:33 +00:00
|
|
|
ceph_free_cap_flush(prealloc_cf);
|
2023-06-12 09:04:07 +08:00
|
|
|
doutc(cl, "%llx.%llx inline_version %llu = %d\n",
|
|
|
|
ceph_vinop(inode), inline_version, err);
|
2014-11-14 22:38:29 +08:00
|
|
|
return err;
|
|
|
|
}
|
|
|
|
|
2015-09-09 15:39:26 -07:00
|
|
|
static const struct vm_operations_struct ceph_vmops = {
|
2013-11-28 14:28:14 +08:00
|
|
|
.fault = ceph_filemap_fault,
|
2009-10-06 11:31:09 -07:00
|
|
|
.page_mkwrite = ceph_page_mkwrite,
|
|
|
|
};
|
|
|
|
|
fs: replace mmap hook with .mmap_prepare for simple mappings
Since commit c84bf6dd2b83 ("mm: introduce new .mmap_prepare() file
callback"), the f_op->mmap() hook has been deprecated in favour of
f_op->mmap_prepare().
This callback is invoked in the mmap() logic far earlier, so error handling
can be performed more safely without complicated and bug-prone state
unwinding required should an error arise.
This hook also avoids passing a pointer to a not-yet-correctly-established
VMA avoiding any issues with referencing this data structure.
It rather provides a pointer to the new struct vm_area_desc descriptor type
which contains all required state and allows easy setting of required
parameters without any consideration needing to be paid to locking or
reference counts.
Note that nested filesystems like overlayfs are compatible with an
.mmap_prepare() callback since commit bb666b7c2707 ("mm: add mmap_prepare()
compatibility layer for nested file systems").
In this patch we apply this change to file systems with relatively simple
mmap() hook logic - exfat, ceph, f2fs, bcachefs, zonefs, btrfs, ocfs2,
orangefs, nilfs2, romfs, ramfs and aio.
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Link: https://lore.kernel.org/f528ac4f35b9378931bd800920fee53fc0c5c74d.1750099179.git.lorenzo.stoakes@oracle.com
Acked-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-06-16 20:33:29 +01:00
|
|
|
int ceph_mmap_prepare(struct vm_area_desc *desc)
|
2009-10-06 11:31:09 -07:00
|
|
|
{
|
fs: replace mmap hook with .mmap_prepare for simple mappings
Since commit c84bf6dd2b83 ("mm: introduce new .mmap_prepare() file
callback"), the f_op->mmap() hook has been deprecated in favour of
f_op->mmap_prepare().
This callback is invoked in the mmap() logic far earlier, so error handling
can be performed more safely without complicated and bug-prone state
unwinding required should an error arise.
This hook also avoids passing a pointer to a not-yet-correctly-established
VMA avoiding any issues with referencing this data structure.
It rather provides a pointer to the new struct vm_area_desc descriptor type
which contains all required state and allows easy setting of required
parameters without any consideration needing to be paid to locking or
reference counts.
Note that nested filesystems like overlayfs are compatible with an
.mmap_prepare() callback since commit bb666b7c2707 ("mm: add mmap_prepare()
compatibility layer for nested file systems").
In this patch we apply this change to file systems with relatively simple
mmap() hook logic - exfat, ceph, f2fs, bcachefs, zonefs, btrfs, ocfs2,
orangefs, nilfs2, romfs, ramfs and aio.
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Link: https://lore.kernel.org/f528ac4f35b9378931bd800920fee53fc0c5c74d.1750099179.git.lorenzo.stoakes@oracle.com
Acked-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-06-16 20:33:29 +01:00
|
|
|
struct address_space *mapping = desc->file->f_mapping;
|
2009-10-06 11:31:09 -07:00
|
|
|
|
2022-04-29 11:53:28 -04:00
|
|
|
if (!mapping->a_ops->read_folio)
|
2009-10-06 11:31:09 -07:00
|
|
|
return -ENOEXEC;
|
fs: replace mmap hook with .mmap_prepare for simple mappings
Since commit c84bf6dd2b83 ("mm: introduce new .mmap_prepare() file
callback"), the f_op->mmap() hook has been deprecated in favour of
f_op->mmap_prepare().
This callback is invoked in the mmap() logic far earlier, so error handling
can be performed more safely without complicated and bug-prone state
unwinding required should an error arise.
This hook also avoids passing a pointer to a not-yet-correctly-established
VMA avoiding any issues with referencing this data structure.
It rather provides a pointer to the new struct vm_area_desc descriptor type
which contains all required state and allows easy setting of required
parameters without any consideration needing to be paid to locking or
reference counts.
Note that nested filesystems like overlayfs are compatible with an
.mmap_prepare() callback since commit bb666b7c2707 ("mm: add mmap_prepare()
compatibility layer for nested file systems").
In this patch we apply this change to file systems with relatively simple
mmap() hook logic - exfat, ceph, f2fs, bcachefs, zonefs, btrfs, ocfs2,
orangefs, nilfs2, romfs, ramfs and aio.
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Link: https://lore.kernel.org/f528ac4f35b9378931bd800920fee53fc0c5c74d.1750099179.git.lorenzo.stoakes@oracle.com
Acked-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-06-16 20:33:29 +01:00
|
|
|
desc->vm_ops = &ceph_vmops;
|
2009-10-06 11:31:09 -07:00
|
|
|
return 0;
|
|
|
|
}
|
2015-04-27 15:33:28 +08:00
|
|
|
|
|
|
|
enum {
|
|
|
|
POOL_READ = 1,
|
|
|
|
POOL_WRITE = 2,
|
|
|
|
};
|
|
|
|
|
2016-03-07 09:35:06 +08:00
|
|
|
static int __ceph_pool_perm_get(struct ceph_inode_info *ci,
|
|
|
|
s64 pool, struct ceph_string *pool_ns)
|
2015-04-27 15:33:28 +08:00
|
|
|
{
|
2023-06-12 10:50:38 +08:00
|
|
|
struct ceph_fs_client *fsc = ceph_inode_to_fs_client(&ci->netfs.inode);
|
2015-04-27 15:33:28 +08:00
|
|
|
struct ceph_mds_client *mdsc = fsc->mdsc;
|
2023-06-12 09:04:07 +08:00
|
|
|
struct ceph_client *cl = fsc->client;
|
2015-04-27 15:33:28 +08:00
|
|
|
struct ceph_osd_request *rd_req = NULL, *wr_req = NULL;
|
|
|
|
struct rb_node **p, *parent;
|
|
|
|
struct ceph_pool_perm *perm;
|
|
|
|
struct page **pages;
|
2016-03-07 09:35:06 +08:00
|
|
|
size_t pool_ns_len;
|
2015-04-27 15:33:28 +08:00
|
|
|
int err = 0, err2 = 0, have = 0;
|
|
|
|
|
|
|
|
down_read(&mdsc->pool_perm_rwsem);
|
|
|
|
p = &mdsc->pool_perm_tree.rb_node;
|
|
|
|
while (*p) {
|
|
|
|
perm = rb_entry(*p, struct ceph_pool_perm, node);
|
|
|
|
if (pool < perm->pool)
|
|
|
|
p = &(*p)->rb_left;
|
|
|
|
else if (pool > perm->pool)
|
|
|
|
p = &(*p)->rb_right;
|
|
|
|
else {
|
2016-03-07 09:35:06 +08:00
|
|
|
int ret = ceph_compare_string(pool_ns,
|
|
|
|
perm->pool_ns,
|
|
|
|
perm->pool_ns_len);
|
|
|
|
if (ret < 0)
|
|
|
|
p = &(*p)->rb_left;
|
|
|
|
else if (ret > 0)
|
|
|
|
p = &(*p)->rb_right;
|
|
|
|
else {
|
|
|
|
have = perm->perm;
|
|
|
|
break;
|
|
|
|
}
|
2015-04-27 15:33:28 +08:00
|
|
|
}
|
|
|
|
}
|
|
|
|
up_read(&mdsc->pool_perm_rwsem);
|
|
|
|
if (*p)
|
|
|
|
goto out;
|
|
|
|
|
2016-03-07 09:35:06 +08:00
|
|
|
if (pool_ns)
|
2023-06-12 09:04:07 +08:00
|
|
|
doutc(cl, "pool %lld ns %.*s no perm cached\n", pool,
|
|
|
|
(int)pool_ns->len, pool_ns->str);
|
2016-03-07 09:35:06 +08:00
|
|
|
else
|
2023-06-12 09:04:07 +08:00
|
|
|
doutc(cl, "pool %lld no perm cached\n", pool);
|
2015-04-27 15:33:28 +08:00
|
|
|
|
|
|
|
down_write(&mdsc->pool_perm_rwsem);
|
2016-03-07 09:35:06 +08:00
|
|
|
p = &mdsc->pool_perm_tree.rb_node;
|
2015-04-27 15:33:28 +08:00
|
|
|
parent = NULL;
|
|
|
|
while (*p) {
|
|
|
|
parent = *p;
|
|
|
|
perm = rb_entry(parent, struct ceph_pool_perm, node);
|
|
|
|
if (pool < perm->pool)
|
|
|
|
p = &(*p)->rb_left;
|
|
|
|
else if (pool > perm->pool)
|
|
|
|
p = &(*p)->rb_right;
|
|
|
|
else {
|
2016-03-07 09:35:06 +08:00
|
|
|
int ret = ceph_compare_string(pool_ns,
|
|
|
|
perm->pool_ns,
|
|
|
|
perm->pool_ns_len);
|
|
|
|
if (ret < 0)
|
|
|
|
p = &(*p)->rb_left;
|
|
|
|
else if (ret > 0)
|
|
|
|
p = &(*p)->rb_right;
|
|
|
|
else {
|
|
|
|
have = perm->perm;
|
|
|
|
break;
|
|
|
|
}
|
2015-04-27 15:33:28 +08:00
|
|
|
}
|
|
|
|
}
|
|
|
|
if (*p) {
|
|
|
|
up_write(&mdsc->pool_perm_rwsem);
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
2016-02-16 15:00:24 +01:00
|
|
|
rd_req = ceph_osdc_alloc_request(&fsc->client->osdc, NULL,
|
2015-04-27 15:33:28 +08:00
|
|
|
1, false, GFP_NOFS);
|
|
|
|
if (!rd_req) {
|
|
|
|
err = -ENOMEM;
|
|
|
|
goto out_unlock;
|
|
|
|
}
|
|
|
|
|
|
|
|
rd_req->r_flags = CEPH_OSD_FLAG_READ;
|
|
|
|
osd_req_op_init(rd_req, 0, CEPH_OSD_OP_STAT, 0);
|
|
|
|
rd_req->r_base_oloc.pool = pool;
|
2016-03-07 09:35:06 +08:00
|
|
|
if (pool_ns)
|
|
|
|
rd_req->r_base_oloc.pool_ns = ceph_get_string(pool_ns);
|
2016-04-29 19:54:20 +02:00
|
|
|
ceph_oid_printf(&rd_req->r_base_oid, "%llx.00000000", ci->i_vino.ino);
|
2015-04-27 15:33:28 +08:00
|
|
|
|
2016-04-27 14:15:51 +02:00
|
|
|
err = ceph_osdc_alloc_messages(rd_req, GFP_NOFS);
|
|
|
|
if (err)
|
|
|
|
goto out_unlock;
|
2015-04-27 15:33:28 +08:00
|
|
|
|
2016-02-16 15:00:24 +01:00
|
|
|
wr_req = ceph_osdc_alloc_request(&fsc->client->osdc, NULL,
|
2015-04-27 15:33:28 +08:00
|
|
|
1, false, GFP_NOFS);
|
|
|
|
if (!wr_req) {
|
|
|
|
err = -ENOMEM;
|
|
|
|
goto out_unlock;
|
|
|
|
}
|
|
|
|
|
2017-02-11 18:48:41 +01:00
|
|
|
wr_req->r_flags = CEPH_OSD_FLAG_WRITE;
|
2015-04-27 15:33:28 +08:00
|
|
|
osd_req_op_init(wr_req, 0, CEPH_OSD_OP_CREATE, CEPH_OSD_OP_FLAG_EXCL);
|
2016-04-28 16:07:23 +02:00
|
|
|
ceph_oloc_copy(&wr_req->r_base_oloc, &rd_req->r_base_oloc);
|
2016-04-29 19:54:20 +02:00
|
|
|
ceph_oid_copy(&wr_req->r_base_oid, &rd_req->r_base_oid);
|
2015-04-27 15:33:28 +08:00
|
|
|
|
2016-04-27 14:15:51 +02:00
|
|
|
err = ceph_osdc_alloc_messages(wr_req, GFP_NOFS);
|
|
|
|
if (err)
|
|
|
|
goto out_unlock;
|
2015-04-27 15:33:28 +08:00
|
|
|
|
|
|
|
/* one page should be large enough for STAT data */
|
|
|
|
pages = ceph_alloc_page_vector(1, GFP_KERNEL);
|
|
|
|
if (IS_ERR(pages)) {
|
|
|
|
err = PTR_ERR(pages);
|
|
|
|
goto out_unlock;
|
|
|
|
}
|
|
|
|
|
|
|
|
osd_req_op_raw_data_in_pages(rd_req, 0, pages, PAGE_SIZE,
|
|
|
|
0, false, true);
|
2022-06-30 16:21:50 -04:00
|
|
|
ceph_osdc_start_request(&fsc->client->osdc, rd_req);
|
2015-04-27 15:33:28 +08:00
|
|
|
|
2023-10-04 14:52:09 -04:00
|
|
|
wr_req->r_mtime = inode_get_mtime(&ci->netfs.inode);
|
2022-06-30 16:21:50 -04:00
|
|
|
ceph_osdc_start_request(&fsc->client->osdc, wr_req);
|
2015-04-27 15:33:28 +08:00
|
|
|
|
2022-06-30 16:21:50 -04:00
|
|
|
err = ceph_osdc_wait_request(&fsc->client->osdc, rd_req);
|
|
|
|
err2 = ceph_osdc_wait_request(&fsc->client->osdc, wr_req);
|
2015-04-27 15:33:28 +08:00
|
|
|
|
|
|
|
if (err >= 0 || err == -ENOENT)
|
|
|
|
have |= POOL_READ;
|
2019-07-25 20:16:47 +08:00
|
|
|
else if (err != -EPERM) {
|
2020-09-14 13:39:19 +02:00
|
|
|
if (err == -EBLOCKLISTED)
|
|
|
|
fsc->blocklisted = true;
|
2015-04-27 15:33:28 +08:00
|
|
|
goto out_unlock;
|
2019-07-25 20:16:47 +08:00
|
|
|
}
|
2015-04-27 15:33:28 +08:00
|
|
|
|
|
|
|
if (err2 == 0 || err2 == -EEXIST)
|
|
|
|
have |= POOL_WRITE;
|
|
|
|
else if (err2 != -EPERM) {
|
2020-09-14 13:39:19 +02:00
|
|
|
if (err2 == -EBLOCKLISTED)
|
|
|
|
fsc->blocklisted = true;
|
2015-04-27 15:33:28 +08:00
|
|
|
err = err2;
|
|
|
|
goto out_unlock;
|
|
|
|
}
|
|
|
|
|
2016-03-07 09:35:06 +08:00
|
|
|
pool_ns_len = pool_ns ? pool_ns->len : 0;
|
2024-09-12 17:39:24 +02:00
|
|
|
perm = kmalloc(struct_size(perm, pool_ns, pool_ns_len + 1), GFP_NOFS);
|
2015-04-27 15:33:28 +08:00
|
|
|
if (!perm) {
|
|
|
|
err = -ENOMEM;
|
|
|
|
goto out_unlock;
|
|
|
|
}
|
|
|
|
|
|
|
|
perm->pool = pool;
|
|
|
|
perm->perm = have;
|
2016-03-07 09:35:06 +08:00
|
|
|
perm->pool_ns_len = pool_ns_len;
|
|
|
|
if (pool_ns_len > 0)
|
|
|
|
memcpy(perm->pool_ns, pool_ns->str, pool_ns_len);
|
|
|
|
perm->pool_ns[pool_ns_len] = 0;
|
|
|
|
|
2015-04-27 15:33:28 +08:00
|
|
|
rb_link_node(&perm->node, parent, p);
|
|
|
|
rb_insert_color(&perm->node, &mdsc->pool_perm_tree);
|
|
|
|
err = 0;
|
|
|
|
out_unlock:
|
|
|
|
up_write(&mdsc->pool_perm_rwsem);
|
|
|
|
|
2016-04-26 15:05:29 +02:00
|
|
|
ceph_osdc_put_request(rd_req);
|
|
|
|
ceph_osdc_put_request(wr_req);
|
2015-04-27 15:33:28 +08:00
|
|
|
out:
|
|
|
|
if (!err)
|
|
|
|
err = have;
|
2016-03-07 09:35:06 +08:00
|
|
|
if (pool_ns)
|
2023-06-12 09:04:07 +08:00
|
|
|
doutc(cl, "pool %lld ns %.*s result = %d\n", pool,
|
|
|
|
(int)pool_ns->len, pool_ns->str, err);
|
2016-03-07 09:35:06 +08:00
|
|
|
else
|
2023-06-12 09:04:07 +08:00
|
|
|
doutc(cl, "pool %lld result = %d\n", pool, err);
|
2015-04-27 15:33:28 +08:00
|
|
|
return err;
|
|
|
|
}
|
|
|
|
|
2019-07-25 20:16:43 +08:00
|
|
|
int ceph_pool_perm_check(struct inode *inode, int need)
|
2015-04-27 15:33:28 +08:00
|
|
|
{
|
2023-06-12 09:04:07 +08:00
|
|
|
struct ceph_client *cl = ceph_inode_to_client(inode);
|
2019-07-25 20:16:43 +08:00
|
|
|
struct ceph_inode_info *ci = ceph_inode(inode);
|
2016-03-07 09:35:06 +08:00
|
|
|
struct ceph_string *pool_ns;
|
2019-07-25 20:16:43 +08:00
|
|
|
s64 pool;
|
2015-04-27 15:33:28 +08:00
|
|
|
int ret, flags;
|
|
|
|
|
2021-01-26 11:49:54 -05:00
|
|
|
/* Only need to do this for regular files */
|
|
|
|
if (!S_ISREG(inode->i_mode))
|
|
|
|
return 0;
|
|
|
|
|
2016-12-13 16:03:26 +08:00
|
|
|
if (ci->i_vino.snap != CEPH_NOSNAP) {
|
|
|
|
/*
|
|
|
|
* Pool permission check needs to write to the first object.
|
2024-11-15 16:11:56 +03:00
|
|
|
* But for snapshot, head of the first object may have already
|
2016-12-13 16:03:26 +08:00
|
|
|
* been deleted. Skip check to avoid creating orphan object.
|
|
|
|
*/
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2023-06-12 10:50:38 +08:00
|
|
|
if (ceph_test_mount_opt(ceph_inode_to_fs_client(inode),
|
2015-04-27 15:33:28 +08:00
|
|
|
NOPOOLPERM))
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
spin_lock(&ci->i_ceph_lock);
|
|
|
|
flags = ci->i_ceph_flags;
|
2016-02-03 21:24:49 +08:00
|
|
|
pool = ci->i_layout.pool_id;
|
2015-04-27 15:33:28 +08:00
|
|
|
spin_unlock(&ci->i_ceph_lock);
|
|
|
|
check:
|
|
|
|
if (flags & CEPH_I_POOL_PERM) {
|
|
|
|
if ((need & CEPH_CAP_FILE_RD) && !(flags & CEPH_I_POOL_RD)) {
|
2023-06-12 09:04:07 +08:00
|
|
|
doutc(cl, "pool %lld no read perm\n", pool);
|
2015-04-27 15:33:28 +08:00
|
|
|
return -EPERM;
|
|
|
|
}
|
|
|
|
if ((need & CEPH_CAP_FILE_WR) && !(flags & CEPH_I_POOL_WR)) {
|
2023-06-12 09:04:07 +08:00
|
|
|
doutc(cl, "pool %lld no write perm\n", pool);
|
2015-04-27 15:33:28 +08:00
|
|
|
return -EPERM;
|
|
|
|
}
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2016-03-07 09:35:06 +08:00
|
|
|
pool_ns = ceph_try_get_string(ci->i_layout.pool_ns);
|
|
|
|
ret = __ceph_pool_perm_get(ci, pool, pool_ns);
|
|
|
|
ceph_put_string(pool_ns);
|
2015-04-27 15:33:28 +08:00
|
|
|
if (ret < 0)
|
|
|
|
return ret;
|
|
|
|
|
|
|
|
flags = CEPH_I_POOL_PERM;
|
|
|
|
if (ret & POOL_READ)
|
|
|
|
flags |= CEPH_I_POOL_RD;
|
|
|
|
if (ret & POOL_WRITE)
|
|
|
|
flags |= CEPH_I_POOL_WR;
|
|
|
|
|
|
|
|
spin_lock(&ci->i_ceph_lock);
|
2016-03-07 09:35:06 +08:00
|
|
|
if (pool == ci->i_layout.pool_id &&
|
|
|
|
pool_ns == rcu_dereference_raw(ci->i_layout.pool_ns)) {
|
|
|
|
ci->i_ceph_flags |= flags;
|
2015-04-27 15:33:28 +08:00
|
|
|
} else {
|
2016-02-03 21:24:49 +08:00
|
|
|
pool = ci->i_layout.pool_id;
|
2015-04-27 15:33:28 +08:00
|
|
|
flags = ci->i_ceph_flags;
|
|
|
|
}
|
|
|
|
spin_unlock(&ci->i_ceph_lock);
|
|
|
|
goto check;
|
|
|
|
}
|
|
|
|
|
|
|
|
void ceph_pool_perm_destroy(struct ceph_mds_client *mdsc)
|
|
|
|
{
|
|
|
|
struct ceph_pool_perm *perm;
|
|
|
|
struct rb_node *n;
|
|
|
|
|
|
|
|
while (!RB_EMPTY_ROOT(&mdsc->pool_perm_tree)) {
|
|
|
|
n = rb_first(&mdsc->pool_perm_tree);
|
|
|
|
perm = rb_entry(n, struct ceph_pool_perm, node);
|
|
|
|
rb_erase(n, &mdsc->pool_perm_tree);
|
|
|
|
kfree(perm);
|
|
|
|
}
|
|
|
|
}
|