Commit graph

4329 commits

Author SHA1 Message Date
Chuck Lever
b44ffa4c4f NFSD: Block DESTROY_CLIENTID only when there are ongoing async COPY operations
Currently __destroy_client() consults the nfs4_client's async_copies
list to determine whether there are ongoing async COPY operations.
However, NFSD now keeps copy state in that list even when the
async copy has completed, to enable OFFLOAD_STATUS to find the
COPY results for a while after the COPY has completed.

DESTROY_CLIENTID should not be blocked if the client's async_copies
list contains state for only completed copy operations.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-18 20:23:10 -05:00
Chuck Lever
5c41f32147 NFSD: Handle an NFS4ERR_DELAY response to CB_OFFLOAD
RFC 7862 permits callback services to respond to CB_OFFLOAD with
NFS4ERR_DELAY. Currently NFSD drops the CB_OFFLOAD in that case.

To improve the reliability of COPY offload, NFSD should rather send
another CB_OFFLOAD completion notification.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-18 20:23:10 -05:00
Chuck Lever
409d6f52bd NFSD: Free async copy information in nfsd4_cb_offload_release()
RFC 7862 Section 4.8 states:

> A copy offload stateid will be valid until either (A) the client
> or server restarts or (B) the client returns the resource by
> issuing an OFFLOAD_CANCEL operation or the client replies to a
> CB_OFFLOAD operation.

Currently, NFSD purges the metadata for an async COPY operation as
soon as the CB_OFFLOAD callback has been sent. It does not wait even
for the client's CB_OFFLOAD response, as the paragraph above
suggests that it should.

This makes the OFFLOAD_STATUS operation ineffective during the
window between the completion of an asynchronous COPY and the
server's receipt of the corresponding CB_OFFLOAD response. This is
important if, for example, the client responds with NFS4ERR_DELAY,
or the transport is lost before the server receives the response. A
client might use OFFLOAD_STATUS to query the server about the still
pending asynchronous COPY, but NFSD will respond to OFFLOAD_STATUS
as if it had never heard of the presented copy stateid.

This patch starts to address this issue by extending the lifetime of
struct nfsd4_copy at least until the server has seen the client's
CB_OFFLOAD response, or the CB_OFFLOAD has timed out.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-18 20:23:10 -05:00
Chuck Lever
62a8642ba0 NFSD: Fix nfsd4_shutdown_copy()
nfsd4_shutdown_copy() is just this:

	while ((copy = nfsd4_get_copy(clp)) != NULL)
		nfsd4_stop_copy(copy);

nfsd4_get_copy() bumps @copy's reference count, preventing
nfsd4_stop_copy() from releasing @copy.

A while loop like this usually works by removing the first element
of the list, but neither nfsd4_get_copy() nor nfsd4_stop_copy()
alters the async_copies list.

Best I can tell, then, is that nfsd4_shutdown_copy() continues to
loop until other threads manage to remove all the items from this
list. The spinning loop blocks shutdown until these items are gone.

Possibly the reason we haven't seen this issue in the field is
because client_has_state() prevents __destroy_client() from calling
nfsd4_shutdown_copy() if there are any items on this list. In a
subsequent patch I plan to remove that restriction.

Fixes: e0639dc580 ("NFSD introduce async copy feature")
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-18 20:23:09 -05:00
Chuck Lever
a4452e661b NFSD: Add a tracepoint to record canceled async COPY operations
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-18 20:23:09 -05:00
Jeff Layton
10c93b5101 nfsd: make nfsd4_session->se_flags a bool
While this holds the flags from the CREATE_SESSION request, nothing
ever consults them. The only flag used is NFS4_SESSION_DEAD. Make it a
simple bool instead.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-18 20:23:09 -05:00
Jeff Layton
53f9ba78e0 nfsd: remove nfsd4_session->se_bchannel
This field is written and is never consulted again. Remove it.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-18 20:23:08 -05:00
NeilBrown
6a404f475f nfsd: make use of warning provided by refcount_t
refcount_t, by design, checks for unwanted situations and provides
warnings.  It is rarely useful to have explicit warnings with refcount
usage.

In this case we have an explicit warning if a refcount_t reaches zero
when decremented.  Simply using refcount_dec() will provide a similar
warning and also mark the refcount_t as saturated to avoid any possible
use-after-free.

This patch drops the warning and uses refcount_dec() instead of
refcount_dec_and_test().

Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-18 20:23:08 -05:00
NeilBrown
a2c0412c05 nfsd: Don't fail OP_SETCLIENTID when there are too many clients.
Failing OP_SETCLIENTID or OP_EXCHANGE_ID should only happen if there is
memory allocation failure.  Putting a hard limit on the number of
clients is not really helpful as it will either happen too early and
prevent clients that the server can easily handle, or too late and
allow clients when the server is swamped.

The calculated limit is still useful for expiring courtesy clients where
there are "too many" clients, but it shouldn't prevent the creation of
active clients.

Testing of lots of clients against small-mem servers reports repeated
NFS4ERR_DELAY responses which doesn't seem helpful.  There may have been
reports of similar problems in production use.

Also remove an outdated comment - we do use a slab cache.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-18 20:23:07 -05:00
Yang Erkun
f8c989a0c8 nfsd: release svc_expkey/svc_export with rcu_work
The last reference for `cache_head` can be reduced to zero in `c_show`
and `e_show`(using `rcu_read_lock` and `rcu_read_unlock`). Consequently,
`svc_export_put` and `expkey_put` will be invoked, leading to two
issues:

1. The `svc_export_put` will directly free ex_uuid. However,
   `e_show`/`c_show` will access `ex_uuid` after `cache_put`, which can
   trigger a use-after-free issue, shown below.

   ==================================================================
   BUG: KASAN: slab-use-after-free in svc_export_show+0x362/0x430 [nfsd]
   Read of size 1 at addr ff11000010fdc120 by task cat/870

   CPU: 1 UID: 0 PID: 870 Comm: cat Not tainted 6.12.0-rc3+ #1
   Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
   1.16.1-2.fc37 04/01/2014
   Call Trace:
    <TASK>
    dump_stack_lvl+0x53/0x70
    print_address_description.constprop.0+0x2c/0x3a0
    print_report+0xb9/0x280
    kasan_report+0xae/0xe0
    svc_export_show+0x362/0x430 [nfsd]
    c_show+0x161/0x390 [sunrpc]
    seq_read_iter+0x589/0x770
    seq_read+0x1e5/0x270
    proc_reg_read+0xe1/0x140
    vfs_read+0x125/0x530
    ksys_read+0xc1/0x160
    do_syscall_64+0x5f/0x170
    entry_SYSCALL_64_after_hwframe+0x76/0x7e

   Allocated by task 830:
    kasan_save_stack+0x20/0x40
    kasan_save_track+0x14/0x30
    __kasan_kmalloc+0x8f/0xa0
    __kmalloc_node_track_caller_noprof+0x1bc/0x400
    kmemdup_noprof+0x22/0x50
    svc_export_parse+0x8a9/0xb80 [nfsd]
    cache_do_downcall+0x71/0xa0 [sunrpc]
    cache_write_procfs+0x8e/0xd0 [sunrpc]
    proc_reg_write+0xe1/0x140
    vfs_write+0x1a5/0x6d0
    ksys_write+0xc1/0x160
    do_syscall_64+0x5f/0x170
    entry_SYSCALL_64_after_hwframe+0x76/0x7e

   Freed by task 868:
    kasan_save_stack+0x20/0x40
    kasan_save_track+0x14/0x30
    kasan_save_free_info+0x3b/0x60
    __kasan_slab_free+0x37/0x50
    kfree+0xf3/0x3e0
    svc_export_put+0x87/0xb0 [nfsd]
    cache_purge+0x17f/0x1f0 [sunrpc]
    nfsd_destroy_serv+0x226/0x2d0 [nfsd]
    nfsd_svc+0x125/0x1e0 [nfsd]
    write_threads+0x16a/0x2a0 [nfsd]
    nfsctl_transaction_write+0x74/0xa0 [nfsd]
    vfs_write+0x1a5/0x6d0
    ksys_write+0xc1/0x160
    do_syscall_64+0x5f/0x170
    entry_SYSCALL_64_after_hwframe+0x76/0x7e

2. We cannot sleep while using `rcu_read_lock`/`rcu_read_unlock`.
   However, `svc_export_put`/`expkey_put` will call path_put, which
   subsequently triggers a sleeping operation due to the following
   `dput`.

   =============================
   WARNING: suspicious RCU usage
   5.10.0-dirty #141 Not tainted
   -----------------------------
   ...
   Call Trace:
   dump_stack+0x9a/0xd0
   ___might_sleep+0x231/0x240
   dput+0x39/0x600
   path_put+0x1b/0x30
   svc_export_put+0x17/0x80
   e_show+0x1c9/0x200
   seq_read_iter+0x63f/0x7c0
   seq_read+0x226/0x2d0
   vfs_read+0x113/0x2c0
   ksys_read+0xc9/0x170
   do_syscall_64+0x33/0x40
   entry_SYSCALL_64_after_hwframe+0x67/0xd1

Fix these issues by using `rcu_work` to help release
`svc_expkey`/`svc_export`. This approach allows for an asynchronous
context to invoke `path_put` and also facilitates the freeing of
`uuid/exp/key` after an RCU grace period.

Fixes: 9ceddd9da1 ("knfsd: Allow lockless lookups of the exports")
Signed-off-by: Yang Erkun <yangerkun@huawei.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-18 20:23:05 -05:00
Yang Erkun
be8f982c36 nfsd: make sure exp active before svc_export_show
The function `e_show` was called with protection from RCU. This only
ensures that `exp` will not be freed. Therefore, the reference count for
`exp` can drop to zero, which will trigger a refcount use-after-free
warning when `exp_get` is called. To resolve this issue, use
`cache_get_rcu` to ensure that `exp` remains active.

------------[ cut here ]------------
refcount_t: addition on 0; use-after-free.
WARNING: CPU: 3 PID: 819 at lib/refcount.c:25
refcount_warn_saturate+0xb1/0x120
CPU: 3 UID: 0 PID: 819 Comm: cat Not tainted 6.12.0-rc3+ #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
1.16.1-2.fc37 04/01/2014
RIP: 0010:refcount_warn_saturate+0xb1/0x120
...
Call Trace:
 <TASK>
 e_show+0x20b/0x230 [nfsd]
 seq_read_iter+0x589/0x770
 seq_read+0x1e5/0x270
 vfs_read+0x125/0x530
 ksys_read+0xc1/0x160
 do_syscall_64+0x5f/0x170
 entry_SYSCALL_64_after_hwframe+0x76/0x7e

Fixes: bf18f163e8 ("NFSD: Using exp_get for export getting")
Cc: stable@vger.kernel.org # 4.20+
Signed-off-by: Yang Erkun <yangerkun@huawei.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-18 20:23:05 -05:00
Chuck Lever
f64ea4af43 NFSD: Cap the number of bytes copied by nfs4_reset_recoverydir()
It's only current caller already length-checks the string, but let's
be safe.

Fixes: 0964a3d3f1 ("[PATCH] knfsd: nfsd4 reboot dirname fix")
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-18 20:23:02 -05:00
Chuck Lever
30c1d2411a NFSD: Remove unused values from nfsd4_encode_components_esc()
Clean up. The computed value of @p is saved each time through the
loop but is never used.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-18 20:23:02 -05:00
Chuck Lever
6b9c1080a6 NFSD: Remove unused results in nfsd4_encode_pathname4()
Clean up. The result of "*p++" is saved, but is not used before it
is overwritten. The result of xdr_encode_opaque() is saved each
time through the loop but is never used.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-18 20:23:02 -05:00
Chuck Lever
1e02c641c3 NFSD: Prevent NULL dereference in nfsd4_process_cb_update()
@ses is initialized to NULL. If __nfsd4_find_backchannel() finds no
available backchannel session, setup_callback_client() will try to
dereference @ses and segfault.

Fixes: dcbeaa68db ("nfsd4: allow backchannel recovery")
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-18 20:23:01 -05:00
Chuck Lever
da4f777e62 NFSD: Remove a never-true comparison
fh_size is an unsigned int, thus it can never be less than 0.

Fixes: d8b26071e6 ("NFSD: simplify struct nfsfh")
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-18 20:23:01 -05:00
Chuck Lever
d08bf5ea64 NFSD: Remove dead code in nfsd4_create_session()
Clean up. AFAICT, there is no way to reach the out_free_conn label
with @old set to a non-NULL value, so the expire_client(old) call
is never reached and can be removed.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-18 20:23:01 -05:00
NeilBrown
4cc9b9f2bf nfsd: refine and rename NFSD_MAY_LOCK
NFSD_MAY_LOCK means a few different things.
- it means that GSS is not required.
- it means that with NFSEXP_NOAUTHNLM, authentication is not required
- it means that OWNER_OVERRIDE is allowed.

None of these are specific to locking, they are specific to the NLM
protocol.
So:
 - rename to NFSD_MAY_NLM
 - set NFSD_MAY_OWNER_OVERRIDE and NFSD_MAY_BYPASS_GSS in nlm_fopen()
   so that NFSD_MAY_NLM doesn't need to imply these.
 - move the test on NFSEXP_NOAUTHNLM out of nfsd_permission() and
   into fh_verify where other special-case tests on the MAY flags
   happen.  nfsd_permission() can be called from other places than
   fh_verify(), but none of these will have NFSD_MAY_NLM.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-18 20:23:00 -05:00
Chuck Lever
6640556b0c NFSD: Replace use of NFSD_MAY_LOCK in nfsd4_lock()
NFSv4 LOCK operations should not avoid the set of authorization
checks that apply to all other NFSv4 operations. Also, the
"no_auth_nlm" export option should apply only to NLM LOCK requests.
It's not necessary or sensible to apply it to NFSv4 LOCK operations.

Instead, set no permission bits when calling fh_verify(). Subsequent
stateid processing handles authorization checks.

Reported-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-18 20:23:00 -05:00
Julia Lawall
ed9887b876 nfsd: replace call_rcu by kfree_rcu for simple kmem_cache_free callback
Since SLOB was removed and since
commit 6c6c47b063 ("mm, slab: call kvfree_rcu_barrier() from kmem_cache_destroy()"),
it is not necessary to use call_rcu when the callback only performs
kmem_cache_free. Use kfree_rcu() directly.

The changes were made using Coccinelle.

Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Acked-by: Paul E. McKenney <paulmck@kernel.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-18 20:23:00 -05:00
Pali Rohár
bb4f07f240 nfsd: Fix NFSD_MAY_BYPASS_GSS and NFSD_MAY_BYPASS_GSS_ON_ROOT
Currently NFSD_MAY_BYPASS_GSS and NFSD_MAY_BYPASS_GSS_ON_ROOT do not bypass
only GSS, but bypass any method. This is a problem specially for NFS3
AUTH_NULL-only exports.

The purpose of NFSD_MAY_BYPASS_GSS_ON_ROOT is described in RFC 2623,
section 2.3.2, to allow mounting NFS2/3 GSS-only export without
authentication. So few procedures which do not expose security risk used
during mount time can be called also with AUTH_NONE or AUTH_SYS, to allow
client mount operation to finish successfully.

The problem with current implementation is that for AUTH_NULL-only exports,
the NFSD_MAY_BYPASS_GSS_ON_ROOT is active also for NFS3 AUTH_UNIX mount
attempts which confuse NFS3 clients, and make them think that AUTH_UNIX is
enabled and is working. Linux NFS3 client never switches from AUTH_UNIX to
AUTH_NONE on active mount, which makes the mount inaccessible.

Fix the NFSD_MAY_BYPASS_GSS and NFSD_MAY_BYPASS_GSS_ON_ROOT implementation
and really allow to bypass only exports which have enabled some real
authentication (GSS, TLS, or any other).

The result would be: For AUTH_NULL-only export if client attempts to do
mount with AUTH_UNIX flavor then it will receive access errors, which
instruct client that AUTH_UNIX flavor is not usable and will either try
other auth flavor (AUTH_NULL if enabled) or fails mount procedure.
Similarly if client attempt to do mount with AUTH_NULL flavor and only
AUTH_UNIX flavor is enabled then the client will receive access error.

This should fix problems with AUTH_NULL-only or AUTH_UNIX-only exports if
client attempts to mount it with other auth flavor (e.g. with AUTH_NULL for
AUTH_UNIX-only export, or with AUTH_UNIX for AUTH_NULL-only export).

Signed-off-by: Pali Rohár <pali@kernel.org>
Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-18 20:22:59 -05:00
Pali Rohár
600020927b nfsd: Fill NFSv4.1 server implementation fields in OP_EXCHANGE_ID response
NFSv4.1 OP_EXCHANGE_ID response from server may contain server
implementation details (domain, name and build time) in optional
nfs_impl_id4 field. Currently nfsd does not fill this field.

Send these information in NFSv4.1 OP_EXCHANGE_ID response. Fill them with
the same values as what is Linux NFSv4.1 client doing. Domain is hardcoded
to "kernel.org", name is composed in the same way as "uname -srvm" output
and build time is hardcoded to zeros.

NFSv4.1 client and server implementation fields are useful for statistic
purposes or for identifying type of clients and servers.

Signed-off-by: Pali Rohár <pali@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-18 20:22:58 -05:00
Jeff Layton
b9376c7e42 nfsd: new tracepoint for after op_func in compound processing
Turn nfsd_compound_encode_err tracepoint into a class and add a new
nfsd_compound_op_err tracepoint.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-18 20:22:58 -05:00
Linus Torvalds
70e7730c2a vfs-6.13.misc
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZzcToAAKCRCRxhvAZXjc
 osL9AP948FFumJRC28gDJ4xp+X4eohNOfkgoEG8FTbF2zU6ulwD+O0pr26FqpFli
 pqlG+38UdATImpfqqWjPbb72sBYcfQg=
 =wLUh
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.13.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull misc vfs updates from Christian Brauner:
 "Features:

   - Fixup and improve NLM and kNFSD file lock callbacks

     Last year both GFS2 and OCFS2 had some work done to make their
     locking more robust when exported over NFS. Unfortunately, part of
     that work caused both NLM (for NFS v3 exports) and kNFSD (for
     NFSv4.1+ exports) to no longer send lock notifications to clients

     This in itself is not a huge problem because most NFS clients will
     still poll the server in order to acquire a conflicted lock

     It's important for NLM and kNFSD that they do not block their
     kernel threads inside filesystem's file_lock implementations
     because that can produce deadlocks. We used to make sure of this by
     only trusting that posix_lock_file() can correctly handle blocking
     lock calls asynchronously, so the lock managers would only setup
     their file_lock requests for async callbacks if the filesystem did
     not define its own lock() file operation

     However, when GFS2 and OCFS2 grew the capability to correctly
     handle blocking lock requests asynchronously, they started
     signalling this behavior with EXPORT_OP_ASYNC_LOCK, and the check
     for also trusting posix_lock_file() was inadvertently dropped, so
     now most filesystems no longer produce lock notifications when
     exported over NFS

     Fix this by using an fop_flag which greatly simplifies the problem
     and grooms the way for future uses by both filesystems and lock
     managers alike

   - Add a sysctl to delete the dentry when a file is removed instead of
     making it a negative dentry

     Commit 681ce86235 ("vfs: Delete the associated dentry when
     deleting a file") introduced an unconditional deletion of the
     associated dentry when a file is removed. However, this led to
     performance regressions in specific benchmarks, such as
     ilebench.sum_operations/s, prompting a revert in commit
     4a4be1ad3a ("Revert "vfs: Delete the associated dentry when
     deleting a file""). This reintroduces the concept conditionally
     through a sysctl

   - Expand the statmount() system call:

       * Report the filesystem subtype in a new fs_subtype field to
         e.g., report fuse filesystem subtypes

       * Report the superblock source in a new sb_source field

       * Add a new way to return filesystem specific mount options in an
         option array that returns filesystem specific mount options
         separated by zero bytes and unescaped. This allows caller's to
         retrieve filesystem specific mount options and immediately pass
         them to e.g., fsconfig() without having to unescape or split
         them

       * Report security (LSM) specific mount options in a separate
         security option array. We don't lump them together with
         filesystem specific mount options as security mount options are
         generic and most users aren't interested in them

         The format is the same as for the filesystem specific mount
         option array

   - Support relative paths in fsconfig()'s FSCONFIG_SET_STRING command

   - Optimize acl_permission_check() to avoid costly {g,u}id ownership
     checks if possible

   - Use smp_mb__after_spinlock() to avoid full smp_mb() in evict()

   - Add synchronous wakeup support for ep_poll_callback.

     Currently, epoll only uses wake_up() to wake up task. But sometimes
     there are epoll users which want to use the synchronous wakeup flag
     to give a hint to the scheduler, e.g., the Android binder driver.
     So add a wake_up_sync() define, and use wake_up_sync() when sync is
     true in ep_poll_callback()

  Fixes:

   - Fix kernel documentation for inode_insert5() and iget5_locked()

   - Annotate racy epoll check on file->f_ep

   - Make F_DUPFD_QUERY associative

   - Avoid filename buffer overrun in initramfs

   - Don't let statmount() return empty strings

   - Add a cond_resched() to dump_user_range() to avoid hogging the CPU

   - Don't query the device logical blocksize multiple times for hfsplus

   - Make filemap_read() check that the offset is positive or zero

  Cleanups:

   - Various typo fixes

   - Cleanup wbc_attach_fdatawrite_inode()

   - Add __releases annotation to wbc_attach_and_unlock_inode()

   - Add hugetlbfs tracepoints

   - Fix various vfs kernel doc parameters

   - Remove obsolete TODO comment from io_cancel()

   - Convert wbc_account_cgroup_owner() to take a folio

   - Fix comments for BANDWITH_INTERVAL and wb_domain_writeout_add()

   - Reorder struct posix_acl to save 8 bytes

   - Annotate struct posix_acl with __counted_by()

   - Replace one-element array with flexible array member in freevxfs

   - Use idiomatic atomic64_inc_return() in alloc_mnt_ns()"

* tag 'vfs-6.13.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (35 commits)
  statmount: retrieve security mount options
  vfs: make evict() use smp_mb__after_spinlock instead of smp_mb
  statmount: add flag to retrieve unescaped options
  fs: add the ability for statmount() to report the sb_source
  writeback: wbc_attach_fdatawrite_inode out of line
  writeback: add a __releases annoation to wbc_attach_and_unlock_inode
  fs: add the ability for statmount() to report the fs_subtype
  fs: don't let statmount return empty strings
  fs:aio: Remove TODO comment suggesting hash or array usage in io_cancel()
  hfsplus: don't query the device logical block size multiple times
  freevxfs: Replace one-element array with flexible array member
  fs: optimize acl_permission_check()
  initramfs: avoid filename buffer overrun
  fs/writeback: convert wbc_account_cgroup_owner to take a folio
  acl: Annotate struct posix_acl with __counted_by()
  acl: Realign struct posix_acl to save 8 bytes
  epoll: Add synchronous wakeup support for ep_poll_callback
  coredump: add cond_resched() to dump_user_range
  mm/page-writeback.c: Fix comment of wb_domain_writeout_add()
  mm/page-writeback.c: Update comment for BANDWIDTH_INTERVAL
  ...
2024-11-18 09:35:30 -08:00
Kairui Song
da0c02516c mm/list_lru: simplify the list_lru walk callback function
Now isolation no longer takes the list_lru global node lock, only use the
per-cgroup lock instead.  And this lock is inside the list_lru_one being
walked, no longer needed to pass the lock explicitly.

Link: https://lkml.kernel.org/r/20241104175257.60853-7-ryncsn@gmail.com
Signed-off-by: Kairui Song <kasong@tencent.com>
Cc: Chengming Zhou <zhouchengming@bytedance.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Waiman Long <longman@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-11 17:22:26 -08:00
Jeff Layton
f6259e2e4f nfsd: have nfsd4_deleg_getattr_conflict pass back write deleg pointer
Currently we pass back the size and whether it has been modified, but
those just mirror values tracked inside the delegation. In a later
patch, we'll need to get at the timestamps in the delegation too, so
just pass back a reference to the write delegation, and use that to
properly override values in the iattr.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-11 13:42:07 -05:00
Jeff Layton
3a405432e7 nfsd: drop the nfsd4_fattr_args "size" field
We already have a slot for this in the kstat structure. Just overwrite
that instead of keeping a copy.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-11 13:42:07 -05:00
Jeff Layton
c757ca1a56 nfsd: drop the ncf_cb_bmap field
This is always the same value, and in a later patch we're going to need
to set bits in WORD2. We can simplify this code and save a little space
in the delegation too. Just hardcode the bitmap in the callback encode
function.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-11 13:42:07 -05:00
Jeff Layton
f67eef8da0 nfsd: drop inode parameter from nfsd4_change_attribute()
The inode that nfs4_open_delegation() passes to this function is
wrong, which throws off the result. The inode will end up getting a
directory-style change attr instead of a regular-file-style one.

Fix up nfs4_delegation_stat() to fetch STATX_MODE, and then drop the
inode parameter from nfsd4_change_attribute(), since it's no longer
needed.

Fixes: c5967721e1 ("NFSD: handle GETATTR conflict with write delegation")
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-11 13:42:06 -05:00
Chuck Lever
612196ef5c NFSD: Remove unused function parameter
Clean up: Commit 65294c1f2c ("nfsd: add a new struct file caching
facility to nfsd") moved the fh_verify() call site out of
nfsd_open(). That was the only user of nfsd_open's @rqstp parameter,
so that parameter can be removed.

Reviewed-by: NeilBrown <neilb@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-11 13:41:58 -05:00
Thorsten Blum
b7165ab074 NFSD: Remove unnecessary posix_acl_entry pointer initialization
The posix_acl_entry pointer pe is already initialized by the
FOREACH_ACL_ENTRY() macro. Remove the unnecessary initialization.

Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-11 13:41:58 -05:00
Chuck Lever
7f33b92e5b NFSD: Prevent a potential integer overflow
If the tag length is >= U32_MAX - 3 then the "length + 4" addition
can result in an integer overflow. Address this by splitting the
decoding into several steps so that decode_cb_compound4res() does
not have to perform arithmetic on the unsafe length value.

Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Cc: stable@vger.kernel.org
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-11 13:41:57 -05:00
Linus Torvalds
de2f378f2b nfsd-6.12 fixes:
- Fix a v6.12-rc regression when exporting ext4 filesystems with NFSD
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmcvm+cACgkQM2qzM29m
 f5f27xAAwdRav9cbwNfLEIC2cMqS4stZH6DsjJ0VSUa87eJeO0BggqnEGxgK38a6
 jONYrjD5eWdTIuv3P07UF/dAcANT4drr6VgZURukNwnX6fIV0N2pOGH8z3y5Mban
 JByqkGNs2lXegdTn8DLJb8xhU0ZJlLQ01908IKc4E1YecAy8vq2yDeksEIBi3HOB
 y3WViieDdD1rYqh/OGCl/hQ72g+Sq+C6ionl+YsRiE4lfmcWCbnOUSrF5jLbU4Xa
 K8pTMTE1JoLvBwTe4+Dzg48PNVlqq8x386O7Lu+rK8sG+C4LX7I+BX9PlJzWDMBU
 KWN04lNxtxVqNfKN7A3ADw5bIPmcsrc5Lh3FXIomp1deuQ3dFOqsADoVSca4Z3Am
 Sci5MrNqDMG74n6IYkhDMaySpSFs4Gx7Si2G0zgBMRyaw7EdO6b8rx8s5OdPwvFj
 Xv5x+Fya5cXuJ28a8FPrYpKe/ImjNV5A2x/5cJNiOdqY7i7QBcQzu0QfFHzBcMkL
 lL9svfTFzAo9B/jj3VsBr8UudDU14QUibQNtoTpKv9gwWCIJPLsU+KqdCkTBcgrZ
 X5Cv0JSOrw8fOd9GAOksjrFHkc8p6vBIGZC34tmVLLfOSRPyeG93roT5cCMkQAEQ
 XfWvC/epKjGoLRZepA5GG/Y9EjnpiQvUWJvyznWpY5hFYdRfBu0=
 =ltss
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-6.12-4' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux

Pull nfsd fix from Chuck Lever:

 - Fix a v6.12-rc regression when exporting ext4 filesystems with NFSD

* tag 'nfsd-6.12-4' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
  NFSD: Fix READDIR on NFSv3 mounts of ext4 exports
2024-11-09 13:18:07 -08:00
Chuck Lever
bb1fb40f8b NFSD: Fix READDIR on NFSv3 mounts of ext4 exports
I noticed that recently, simple operations like "make" started
failing on NFSv3 mounts of ext4 exports. Network capture shows that
READDIRPLUS operated correctly but READDIR failed with
NFS3ERR_INVAL. The vfs_llseek() call returned EINVAL when it is
passed a non-zero starting directory cookie.

I bisected to commit c689bdd3bf ("nfsd: further centralize
protocol version checks.").

Turns out that nfsd3_proc_readdir() does not call fh_verify() before
it calls nfsd_readdir(), so the new fhp->fh_64bit_cookies boolean is
not set properly. This leaves the NFSD_MAY_64BIT_COOKIE unset when
the directory is opened.

For ext4, this causes the wrong "max file size" value to be used
when sanity checking the incoming directory cookie (which is a seek
offset value).

The fhp->fh_64bit_cookies boolean is /always/ properly initialized
after nfsd_open() returns. There doesn't seem to be a reason for the
generic NFSD open helper to handle the f_mode fix-up for
directories, so just move that to the one caller that tries to open
an S_IFDIR with NFSD_MAY_64BIT_COOKIE.

Suggested-by: NeilBrown <neilb@suse.de>
Fixes: c689bdd3bf ("nfsd: further centralize protocol version checks.")
Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-07 09:11:37 -05:00
Linus Torvalds
3e5e6c9900 nfsd-6.12 fixes:
- Fix two async COPY bugs found during NFS bake-a-thon
 - Fix an svcrdma memory leak
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmcmT5wACgkQM2qzM29m
 f5coUg/9FQMf1IeXhyGlDtV0ELUxkHoXaZ2T6zhfXFoLmyl/DU4AWMH4YbSAqk2M
 NnR47sVsM08tIwE3KYQgYzLbbFF41nQUa1sckeel+nYGtpcN6IwOlU5LYYpNNFeQ
 vsECxV78BA6FGdjaXwQ07r4G6lpVhCCqM/RZpDrwNSyoIWVLo77KBUVCSoQb5wzG
 z7OBvO9M7HfVJVOHcPd+tVcZaGAF0fhW812fibZQKV2mrWdhOOe+gWVs8ro3tmm1
 GocbTTQW2hlYcLCZPe1przTI9flfwon6Lk8TmIZuU5IrzcaAB+U3P140aKgl9427
 v4WdLuKYlKi+xISBdRG3omyaLroNUs8IHW4KoBXAW3FinyLzNsAyoPxb02m7SEge
 sOJ/gbeLtb2u+ur4wAp4gDmVKfg3TGyh05Hdt96LXsbQUuWIlwEcPurl+nY93Eoq
 vrPLIdPOXrOD5jBIaVQkBYlaCn04mDg+VTNbG9hW1wjorVFpWKS7MwCjVloXSVIn
 uE++cVpQtIKp8aTYBbFVXqtVREatczl++f+Npnlm8xlcquDbaORkk4ZBOs4vQuHo
 pNuZcWO0rIBR6hakr44OjTLnJBIwChPBYvBVgtq1E6oAbwHvC2SVXbiJB8IcCPOx
 nB2jnF0/tpTs2LnrHAxdAGdU9Om6RGmahfz/uwh/8djdbCVH+Ik=
 =NiQh
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-6.12-3' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux

Pull nfsd fixes from Chuck Lever:

 - Fix two async COPY bugs found during NFS bake-a-thon

 - Fix an svcrdma memory leak

* tag 'nfsd-6.12-3' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
  rpcrdma: Always release the rpcrdma_device's xa_array
  NFSD: Never decrement pending_async_copies on error
  NFSD: Initialize struct nfsd4_copy earlier
2024-11-02 09:27:11 -10:00
Chuck Lever
8286f8b622 NFSD: Never decrement pending_async_copies on error
The error flow in nfsd4_copy() calls cleanup_async_copy(), which
already decrements nn->pending_async_copies.

Reported-by: Olga Kornievskaia <okorniev@redhat.com>
Fixes: aadc3bbea1 ("NFSD: Limit the number of concurrent async COPY operations")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-10-30 14:12:16 -04:00
Chuck Lever
63fab04cbd NFSD: Initialize struct nfsd4_copy earlier
Ensure the refcount and async_copies fields are initialized early.
cleanup_async_copy() will reference these fields if an error occurs
in nfsd4_copy(). If they are not correctly initialized, at the very
least, a refcount underflow occurs.

Reported-by: Olga Kornievskaia <okorniev@redhat.com>
Fixes: aadc3bbea1 ("NFSD: Limit the number of concurrent async COPY operations")
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Tested-by: Olga Kornievskaia <okorniev@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-10-29 15:31:18 -04:00
Linus Torvalds
f647053312 nfsd-6.12 fixes:
- Fix a couple of use-after-free bugs
 -----BEGIN PGP SIGNATURE-----
 
 iQIyBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmcbp9YACgkQM2qzM29m
 f5dzAg/4lCFRbsebia7qktW88ZDBbAoZyKolk2DHfjZXCuq5DHb/5I/Hk1rGyTYs
 VaJmCU59ZdpyBFSdQhOYKf2xvgNPvJG02U8il5KWtMAY5cStXFjeU0FSDoC5O4Dl
 9IoaVbtAWGMCjxWJ1WEGpU82JoM7moSVB4G718LlxF+4cUS7idq5se0uK31WQvft
 DmJsOfmnch1Y/7+RRcDwbBu0HwP2ZQHS8zMYMQ2JGXPDZJFenTibezVb36YyzyZn
 WQfvaW6MmdiVL9omZxvURL9WuBKA2L+Ly/92PyHaflcAXSngcpfu28IIQbzp/m7K
 JT/3ad32lB7F3LrDP5I4gwVh8oGLYiI5r7RBWo0e98LPvAR/89gBVdZHhjasstCh
 nAL7Kk6P/jQdbM/KR9T+yS7xTVScI5Wp4Xcitz2mlHgU4br67GO9gpo1e/tqXenm
 Gasapkg5qCduz+ksj2vwpeFXKQi+qwJgfVGKMxELoV8qTazyr09Dfgouqe045ztl
 /0khkOrLkw0bYLDNJKhj/XG0ZEV5V10c/0PEnivC2BHVQioBIDRQAaH2S2G/8vQH
 MdWayGhNlTV0g4DdtMVPxf5uN+uQmfMsj5BIe17NIUYxiJnw0CSM/bzHUcJ15+DH
 nTblaek6sa5BFpZ2fed8by4il4/tme3NOJEeHnSqe/3QKyZgOQ==
 =1YOr
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-6.12-2' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux

Pull nfsd fixes from Chuck Lever:

 - Fix a couple of use-after-free bugs

* tag 'nfsd-6.12-2' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
  nfsd: cancel nfsd_shrinker_work using sync mode in nfs4_state_shutdown_net
  nfsd: fix race between laundromat and free_stateid
2024-10-25 11:38:15 -07:00
Yang Erkun
d5ff2fb2e7 nfsd: cancel nfsd_shrinker_work using sync mode in nfs4_state_shutdown_net
In the normal case, when we excute `echo 0 > /proc/fs/nfsd/threads`, the
function `nfs4_state_destroy_net` in `nfs4_state_shutdown_net` will
release all resources related to the hashed `nfs4_client`. If the
`nfsd_client_shrinker` is running concurrently, the `expire_client`
function will first unhash this client and then destroy it. This can
lead to the following warning. Additionally, numerous use-after-free
errors may occur as well.

nfsd_client_shrinker         echo 0 > /proc/fs/nfsd/threads

expire_client                nfsd_shutdown_net
  unhash_client                ...
                               nfs4_state_shutdown_net
                                 /* won't wait shrinker exit */
  /*                             cancel_work(&nn->nfsd_shrinker_work)
   * nfsd_file for this          /* won't destroy unhashed client1 */
   * client1 still alive         nfs4_state_destroy_net
   */

                               nfsd_file_cache_shutdown
                                 /* trigger warning */
                                 kmem_cache_destroy(nfsd_file_slab)
                                 kmem_cache_destroy(nfsd_file_mark_slab)
  /* release nfsd_file and mark */
  __destroy_client

====================================================================
BUG nfsd_file (Not tainted): Objects remaining in nfsd_file on
__kmem_cache_shutdown()
--------------------------------------------------------------------
CPU: 4 UID: 0 PID: 764 Comm: sh Not tainted 6.12.0-rc3+ #1

 dump_stack_lvl+0x53/0x70
 slab_err+0xb0/0xf0
 __kmem_cache_shutdown+0x15c/0x310
 kmem_cache_destroy+0x66/0x160
 nfsd_file_cache_shutdown+0xac/0x210 [nfsd]
 nfsd_destroy_serv+0x251/0x2a0 [nfsd]
 nfsd_svc+0x125/0x1e0 [nfsd]
 write_threads+0x16a/0x2a0 [nfsd]
 nfsctl_transaction_write+0x74/0xa0 [nfsd]
 vfs_write+0x1a5/0x6d0
 ksys_write+0xc1/0x160
 do_syscall_64+0x5f/0x170
 entry_SYSCALL_64_after_hwframe+0x76/0x7e

====================================================================
BUG nfsd_file_mark (Tainted: G    B   W         ): Objects remaining
nfsd_file_mark on __kmem_cache_shutdown()
--------------------------------------------------------------------

 dump_stack_lvl+0x53/0x70
 slab_err+0xb0/0xf0
 __kmem_cache_shutdown+0x15c/0x310
 kmem_cache_destroy+0x66/0x160
 nfsd_file_cache_shutdown+0xc8/0x210 [nfsd]
 nfsd_destroy_serv+0x251/0x2a0 [nfsd]
 nfsd_svc+0x125/0x1e0 [nfsd]
 write_threads+0x16a/0x2a0 [nfsd]
 nfsctl_transaction_write+0x74/0xa0 [nfsd]
 vfs_write+0x1a5/0x6d0
 ksys_write+0xc1/0x160
 do_syscall_64+0x5f/0x170
 entry_SYSCALL_64_after_hwframe+0x76/0x7e

To resolve this issue, cancel `nfsd_shrinker_work` using synchronous
mode in nfs4_state_shutdown_net.

Fixes: 7c24fa2250 ("NFSD: replace delayed_work with work_struct for nfsd_client_shrinker")
Signed-off-by: Yang Erkun <yangerkun@huaweicloud.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-10-21 10:27:36 -04:00
Olga Kornievskaia
8dd91e8d31 nfsd: fix race between laundromat and free_stateid
There is a race between laundromat handling of revoked delegations
and a client sending free_stateid operation. Laundromat thread
finds that delegation has expired and needs to be revoked so it
marks the delegation stid revoked and it puts it on a reaper list
but then it unlock the state lock and the actual delegation revocation
happens without the lock. Once the stid is marked revoked a racing
free_stateid processing thread does the following (1) it calls
list_del_init() which removes it from the reaper list and (2) frees
the delegation stid structure. The laundromat thread ends up not
calling the revoke_delegation() function for this particular delegation
but that means it will no release the lock lease that exists on
the file.

Now, a new open for this file comes in and ends up finding that
lease list isn't empty and calls nfsd_breaker_owns_lease() which ends
up trying to derefence a freed delegation stateid. Leading to the
followint use-after-free KASAN warning:

kernel: ==================================================================
kernel: BUG: KASAN: slab-use-after-free in nfsd_breaker_owns_lease+0x140/0x160 [nfsd]
kernel: Read of size 8 at addr ffff0000e73cd0c8 by task nfsd/6205
kernel:
kernel: CPU: 2 UID: 0 PID: 6205 Comm: nfsd Kdump: loaded Not tainted 6.11.0-rc7+ #9
kernel: Hardware name: Apple Inc. Apple Virtualization Generic Platform, BIOS 2069.0.0.0.0 08/03/2024
kernel: Call trace:
kernel: dump_backtrace+0x98/0x120
kernel: show_stack+0x1c/0x30
kernel: dump_stack_lvl+0x80/0xe8
kernel: print_address_description.constprop.0+0x84/0x390
kernel: print_report+0xa4/0x268
kernel: kasan_report+0xb4/0xf8
kernel: __asan_report_load8_noabort+0x1c/0x28
kernel: nfsd_breaker_owns_lease+0x140/0x160 [nfsd]
kernel: nfsd_file_do_acquire+0xb3c/0x11d0 [nfsd]
kernel: nfsd_file_acquire_opened+0x84/0x110 [nfsd]
kernel: nfs4_get_vfs_file+0x634/0x958 [nfsd]
kernel: nfsd4_process_open2+0xa40/0x1a40 [nfsd]
kernel: nfsd4_open+0xa08/0xe80 [nfsd]
kernel: nfsd4_proc_compound+0xb8c/0x2130 [nfsd]
kernel: nfsd_dispatch+0x22c/0x718 [nfsd]
kernel: svc_process_common+0x8e8/0x1960 [sunrpc]
kernel: svc_process+0x3d4/0x7e0 [sunrpc]
kernel: svc_handle_xprt+0x828/0xe10 [sunrpc]
kernel: svc_recv+0x2cc/0x6a8 [sunrpc]
kernel: nfsd+0x270/0x400 [nfsd]
kernel: kthread+0x288/0x310
kernel: ret_from_fork+0x10/0x20

This patch proposes a fixed that's based on adding 2 new additional
stid's sc_status values that help coordinate between the laundromat
and other operations (nfsd4_free_stateid() and nfsd4_delegreturn()).

First to make sure, that once the stid is marked revoked, it is not
removed by the nfsd4_free_stateid(), the laundromat take a reference
on the stateid. Then, coordinating whether the stid has been put
on the cl_revoked list or we are processing FREE_STATEID and need to
make sure to remove it from the list, each check that state and act
accordingly. If laundromat has added to the cl_revoke list before
the arrival of FREE_STATEID, then nfsd4_free_stateid() knows to remove
it from the list. If nfsd4_free_stateid() finds that operations arrived
before laundromat has placed it on cl_revoke list, it marks the state
freed and then laundromat will no longer add it to the list.

Also, for nfsd4_delegreturn() when looking for the specified stid,
we need to access stid that are marked removed or freeable, it means
the laundromat has started processing it but hasn't finished and this
delegreturn needs to return nfserr_deleg_revoked and not
nfserr_bad_stateid. The latter will not trigger a FREE_STATEID and the
lack of it will leave this stid on the cl_revoked list indefinitely.

Fixes: 2d4a532d38 ("nfsd: ensure that clp->cl_revoked list is protected by clp->cl_lock")
CC: stable@vger.kernel.org
Signed-off-by: Olga Kornievskaia <okorniev@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-10-18 16:40:37 -04:00
Linus Torvalds
6254d53727 NFS Client Bugfixes for Linux 6.12-rc
Localio Bugfixes:
   * Remove duplicated include in localio.c
   * Fix race in NFS calls to nfsd_file_put_local() and nfsd_serv_put()
   * Fix Kconfig for NFS_COMMON_LOCALIO_SUPPORT
   * Fix nfsd_file tracepoints to handle NULL rqstp pointers
 
 Other Bugfixes:
   * Fix program selection loop in svc_process_common
   * Fix integer overflow in decode_rc_list()
   * Prevent NULL-pointer dereference in nfs42_complete_copies()
   * Fix CB_RECALL performance issues when using a large number of delegations
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAmcJjjQACgkQ18tUv7Cl
 QOvgJw/6A33s+pjyBVLIKT6oMCPkUJeQ4Rhg9Je0Qw/ji0eFkT4Eyd65kRz3T9M/
 qRrCfWaUd2dTYcbKQyhuGTlEfICZa9R4I0/Ztk9yvf9xcd1xFXKzTkFekGUVeHQA
 OcngDu9psFxhvyzKI8nAHs1ephX/T7TywvTKANMRbeRCYYvVkytAt9YeVMigYZa5
 dnchoUdGUdL6B6RXCU/Qhf0A1uYyA4hkk/FTBCPgv+kYx5pnjFq0y/yIIHDzCR3I
 +yE1ss3EpVTQgt2Ca/cmDyYXsa7G8G51U7cS5AeIoXfsf1EGtTujowWcBY4oqFEC
 ixx58fQe48AqwsP5XDZn8gnsuYH9snnw5rIB0IVqq55/a+XLMupHayyf/iziMV3s
 JWgT4gKDyFca2pT+bJ8iWweU+ecRYxKGnh2NydyBiqowogsHZm4uKh0vELvqqkBd
 RIjCyIiQVhYBII2jqpjRnxrqhGUT5XO99NQdQIGV0bUjCEP4YAjY4ChfEVcWXhnB
 ppyBP+r8N5O77NcVqsVQS26U0/jb9K30LyYl9VT43ank3d+VVtHA5ZqnUflWtwuc
 2XiGDvXW9mIvbVraWIZXUNVy39bzRclDf5bx4jeYLnKCMym81rkEIBOvBKQKZTrl
 v+1Nhaj+fSw+rFSUm0KPqms0UDiT0Ol7ltu84ifadYqubbSEbqU=
 =QBvR
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-6.12-2' of git://git.linux-nfs.org/projects/anna/linux-nfs

Pull NFS client fixes from Anna Schumaker:
 "Localio Bugfixes:
   - remove duplicated include in localio.c
   - fix race in NFS calls to nfsd_file_put_local() and nfsd_serv_put()
   - fix Kconfig for NFS_COMMON_LOCALIO_SUPPORT
   - fix nfsd_file tracepoints to handle NULL rqstp pointers

  Other Bugfixes:
   - fix program selection loop in svc_process_common
   - fix integer overflow in decode_rc_list()
   - prevent NULL-pointer dereference in nfs42_complete_copies()
   - fix CB_RECALL performance issues when using a large number of
     delegations"

* tag 'nfs-for-6.12-2' of git://git.linux-nfs.org/projects/anna/linux-nfs:
  NFS: remove revoked delegation from server's delegation list
  nfsd/localio: fix nfsd_file tracepoints to handle NULL rqstp
  nfs_common: fix Kconfig for NFS_COMMON_LOCALIO_SUPPORT
  nfs_common: fix race in NFS calls to nfsd_file_put_local() and nfsd_serv_put()
  NFSv4: Prevent NULL-pointer dereference in nfs42_complete_copies()
  SUNRPC: Fix integer overflow in decode_rc_list()
  sunrpc: fix prog selection loop in svc_process_common
  nfs: Remove duplicated include in localio.c
2024-10-11 15:37:15 -07:00
Linus Torvalds
5870963f6c nfsd-6.12 fixes:
- Fix NFSD bring-up / shutdown
 - Fix a UAF when releasing a stateid
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmcHx6IACgkQM2qzM29m
 f5ecExAAheSpSP9D+1n3hKeOlNfhY8FUzd41Arn4NYV+jIBJbtx94/FlSNOxA0mp
 Ovm1I8uAxy4TR8TLt7tsxfT7JDOStKwXFl3QlOUZT/+uyyJr7/q5R959R3oMiccR
 Rfrpj6j2yYWrI8qGDGHca4Vv2bDSxr4mzztWwDe0SHSsjwf4OAcv+XF5vcfZ/CJN
 Bxulb9WfNU8XvdFcRDHokMfk6jiY6/+FCTwX8ckvbVEG6gHT8+CRYSUJ05j0LJGo
 xKZV913NgzcuV7PH0vq6vExJE6+rEPt/ejDAT5FM5yeNe+WJ4RTDgsYyIr9iLbHF
 mWB9M4NnP+EZhejtOCbZ9RZjjKro09ilEPpqILuuGQPtcHSeWmhNbFz0kwLe+zYZ
 CdtjnPZhjB0ITWgZ1HCtoJ8k/ZcMa7iiM/kApMLGr9fVj8/BHHFzS95PK7K/Fqur
 FLdhvo6CzZCnRd16e2kqWsG7wO2lPWcz4NWTf9wxIG5GCunXoVCEnK1VfHvnldbH
 BIFXZ+ib5qnL2i3Qmz7bQxmfIp5ryZnNx1mF0OM8imR9K/rsnARd7JfQ99lpMy8D
 mD4coZVTMMk/Zg9zuH8k5GBzB2zXXqgngp4IJIxqrKR7/AsuSU3R7r+O9CWN91GQ
 GKpRtMn/rVUg81jxDr3qoKquyxONoyVrVXAKsj1PgUSQdjUJgqU=
 =Rud7
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-6.12-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux

Pull nfsd fixes from Chuck Lever:

 - Fix NFSD bring-up / shutdown

 - Fix a UAF when releasing a stateid

* tag 'nfsd-6.12-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
  nfsd: fix possible badness in FREE_STATEID
  nfsd: nfsd_destroy_serv() must call svc_destroy() even if nfsd_startup_net() failed
  NFSD: Mark filecache "down" if init fails
2024-10-10 09:52:49 -07:00
Olga Kornievskaia
c88c150a46 nfsd: fix possible badness in FREE_STATEID
When multiple FREE_STATEIDs are sent for the same delegation stateid,
it can lead to a possible either use-after-free or counter refcount
underflow errors.

In nfsd4_free_stateid() under the client lock we find a delegation
stateid, however the code drops the lock before calling nfs4_put_stid(),
that allows another FREE_STATE to find the stateid again. The first one
will proceed to then free the stateid which leads to either
use-after-free or decrementing already zeroed counter.

Fixes: 3f29cc82a8 ("nfsd: split sc_status out of sc_type")
Signed-off-by: Olga Kornievskaia <okorniev@redhat.com>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-10-05 15:44:25 -04:00
Mike Snitzer
76f5af9952 nfsd/localio: fix nfsd_file tracepoints to handle NULL rqstp
Otherwise nfsd_file_acquire, nfsd_file_insert_err, and
nfsd_file_cons_err will hit a NULL pointer when they are enabled and
LOCALIO used.

Example trace output (note xid is 0x0 and LOCALIO flag set):
 nfsd_file_acquire: xid=0x0 inode=0000000069a1b2e7
 may_flags=WRITE|LOCALIO ref=1 nf_flags=HASHED|GC nf_may=WRITE
 nf_file=0000000070123234 status=0

Fixes: c63f0e48fe ("nfsd: add nfsd_file_acquire_local()")
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2024-10-04 14:52:04 -04:00
Mike Snitzer
65f2a5c366 nfs_common: fix race in NFS calls to nfsd_file_put_local() and nfsd_serv_put()
Add nfs_to_nfsd_file_put_local() interface to fix race with nfsd
module unload.  Similarly, use RCU around nfs_open_local_fh()'s error
path call to nfs_to->nfsd_serv_put().  Holding RCU ensures that NFS
will safely _call and return_ from its nfs_to calls into the NFSD
functions nfsd_file_put_local() and nfsd_serv_put().

Otherwise, if RCU isn't used then there is a narrow window when NFS's
reference for the nfsd_file and nfsd_serv are dropped and the NFSD
module could be unloaded, which could result in a crash from the
return instruction for either nfs_to->nfsd_file_put_local() or
nfs_to->nfsd_serv_put().

Reported-by: NeilBrown <neilb@suse.de>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2024-10-03 16:19:43 -04:00
Lizhi Xu
cad3f4a22c inotify: Fix possible deadlock in fsnotify_destroy_mark
[Syzbot reported]
WARNING: possible circular locking dependency detected
6.11.0-rc4-syzkaller-00019-gb311c1b497e5 #0 Not tainted
------------------------------------------------------
kswapd0/78 is trying to acquire lock:
ffff88801b8d8930 (&group->mark_mutex){+.+.}-{3:3}, at: fsnotify_group_lock include/linux/fsnotify_backend.h:270 [inline]
ffff88801b8d8930 (&group->mark_mutex){+.+.}-{3:3}, at: fsnotify_destroy_mark+0x38/0x3c0 fs/notify/mark.c:578

but task is already holding lock:
ffffffff8ea2fd60 (fs_reclaim){+.+.}-{0:0}, at: balance_pgdat mm/vmscan.c:6841 [inline]
ffffffff8ea2fd60 (fs_reclaim){+.+.}-{0:0}, at: kswapd+0xbb4/0x35a0 mm/vmscan.c:7223

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #1 (fs_reclaim){+.+.}-{0:0}:
       ...
       kmem_cache_alloc_noprof+0x3d/0x2a0 mm/slub.c:4044
       inotify_new_watch fs/notify/inotify/inotify_user.c:599 [inline]
       inotify_update_watch fs/notify/inotify/inotify_user.c:647 [inline]
       __do_sys_inotify_add_watch fs/notify/inotify/inotify_user.c:786 [inline]
       __se_sys_inotify_add_watch+0x72e/0x1070 fs/notify/inotify/inotify_user.c:729
       do_syscall_x64 arch/x86/entry/common.c:52 [inline]
       do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
       entry_SYSCALL_64_after_hwframe+0x77/0x7f

-> #0 (&group->mark_mutex){+.+.}-{3:3}:
       ...
       __mutex_lock+0x136/0xd70 kernel/locking/mutex.c:752
       fsnotify_group_lock include/linux/fsnotify_backend.h:270 [inline]
       fsnotify_destroy_mark+0x38/0x3c0 fs/notify/mark.c:578
       fsnotify_destroy_marks+0x14a/0x660 fs/notify/mark.c:934
       fsnotify_inoderemove include/linux/fsnotify.h:264 [inline]
       dentry_unlink_inode+0x2e0/0x430 fs/dcache.c:403
       __dentry_kill+0x20d/0x630 fs/dcache.c:610
       shrink_kill+0xa9/0x2c0 fs/dcache.c:1055
       shrink_dentry_list+0x2c0/0x5b0 fs/dcache.c:1082
       prune_dcache_sb+0x10f/0x180 fs/dcache.c:1163
       super_cache_scan+0x34f/0x4b0 fs/super.c:221
       do_shrink_slab+0x701/0x1160 mm/shrinker.c:435
       shrink_slab+0x1093/0x14d0 mm/shrinker.c:662
       shrink_one+0x43b/0x850 mm/vmscan.c:4815
       shrink_many mm/vmscan.c:4876 [inline]
       lru_gen_shrink_node mm/vmscan.c:4954 [inline]
       shrink_node+0x3799/0x3de0 mm/vmscan.c:5934
       kswapd_shrink_node mm/vmscan.c:6762 [inline]
       balance_pgdat mm/vmscan.c:6954 [inline]
       kswapd+0x1bcd/0x35a0 mm/vmscan.c:7223

[Analysis]
The problem is that inotify_new_watch() is using GFP_KERNEL to allocate
new watches under group->mark_mutex, however if dentry reclaim races
with unlinking of an inode, it can end up dropping the last dentry reference
for an unlinked inode resulting in removal of fsnotify mark from reclaim
context which wants to acquire group->mark_mutex as well.

This scenario shows that all notification groups are in principle prone
to this kind of a deadlock (previously, we considered only fanotify and
dnotify to be problematic for other reasons) so make sure all
allocations under group->mark_mutex happen with GFP_NOFS.

Reported-and-tested-by: syzbot+c679f13773f295d2da53@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=c679f13773f295d2da53
Signed-off-by: Lizhi Xu <lizhi.xu@windriver.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20240927143642.2369508-1-lizhi.xu@windriver.com
2024-10-02 15:14:29 +02:00
Christian Brauner
09ee2a670d
Merge patch series "Fixup NLM and kNFSD file lock callbacks"
Benjamin Coddington <bcodding@redhat.com> says:

Last year both GFS2 and OCFS2 had some work done to make their locking more
robust when exported over NFS.  Unfortunately, part of that work caused both
NLM (for NFS v3 exports) and kNFSD (for NFSv4.1+ exports) to no longer send
lock notifications to clients.

This in itself is not a huge problem because most NFS clients will still
poll the server in order to acquire a conflicted lock, but now that I've
noticed it I can't help but try to fix it because there are big advantages
for setups that might depend on timely lock notifications, and we've
supported that as a feature for a long time.

Its important for NLM and kNFSD that they do not block their kernel threads
inside filesystem's file_lock implementations because that can produce
deadlocks.  We used to make sure of this by only trusting that
posix_lock_file() can correctly handle blocking lock calls asynchronously,
so the lock managers would only setup their file_lock requests for async
callbacks if the filesystem did not define its own lock() file operation.

However, when GFS2 and OCFS2 grew the capability to correctly
handle blocking lock requests asynchronously, they started signalling this
behavior with EXPORT_OP_ASYNC_LOCK, and the check for also trusting
posix_lock_file() was inadvertently dropped, so now most filesystems no
longer produce lock notifications when exported over NFS.

I tried to fix this by simply including the old check for lock(), but the
resulting include mess and layering violations was more than I could accept.
There's a much cleaner way presented here using an fop_flag, which while
potentially flag-greedy, greatly simplifies the problem and grooms the
way for future uses by both filesystems and lock managers alike.

* patches from https://lore.kernel.org/r/cover.1726083391.git.bcodding@redhat.com:
  exportfs: Remove EXPORT_OP_ASYNC_LOCK
  NLM/NFSD: Fix lock notifications for async-capable filesystems
  gfs2/ocfs2: set FOP_ASYNC_LOCK
  fs: Introduce FOP_ASYNC_LOCK
  NFS: trace: show TIMEDOUT instead of 0x6e
  nfsd: use system_unbound_wq for nfsd_file_gc_worker()
  nfsd: count nfsd_file allocations
  nfsd: fix refcount leak when file is unhashed after being found
  nfsd: remove unneeded EEXIST error check in nfsd_do_file_acquire
  nfsd: add list_head nf_gc to struct nfsd_file

Link: https://lore.kernel.org/r/cover.1726083391.git.bcodding@redhat.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-10-02 07:52:07 +02:00
Benjamin Coddington
7e64c5bc49
NLM/NFSD: Fix lock notifications for async-capable filesystems
Instead of checking just the exportfs flag, use the new
locks_can_async_lock() helper which allows NLM and NFSD to once again
support lock notifications for all filesystems which use posix_lock_file().

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Link: https://lore.kernel.org/r/865c40da44af67939e8eb560d17a26c9c50f23e0.1726083391.git.bcodding@redhat.com
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-10-01 17:00:18 +02:00
Mike Snitzer
946af9b3a0 nfsd: implement server support for NFS_LOCALIO_PROGRAM
The LOCALIO auxiliary RPC protocol consists of a single "UUID_IS_LOCAL"
RPC method that allows the Linux NFS client to verify the local Linux
NFS server can see the nonce (single-use UUID) the client generated and
made available in nfs_common.  The server expects this protocol to use
the same transport as NFS and NFSACL for its RPCs.  This protocol
isn't part of an IETF standard, nor does it need to be considering it
is Linux-to-Linux auxiliary RPC protocol that amounts to an
implementation detail.

The UUID_IS_LOCAL method encodes the client generated uuid_t in terms of
the fixed UUID_SIZE (16 bytes).  The fixed size opaque encode and decode
XDR methods are used instead of the less efficient variable sized
methods.

The RPC program number for the NFS_LOCALIO_PROGRAM is 400122 (as assigned
by IANA, see https://www.iana.org/assignments/rpc-program-numbers/ ):
Linux Kernel Organization       400122  nfslocalio

Signed-off-by: Mike Snitzer <snitzer@kernel.org>
[neilb: factored out and simplified single localio protocol]
Co-developed-by: NeilBrown <neilb@suse.de>
Signed-off-by: NeilBrown <neilb@suse.de>
Acked-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2024-09-23 15:03:30 -04:00
Weston Andros Adamson
fa4983862e nfsd: add LOCALIO support
Add server support for bypassing NFS for localhost reads, writes, and
commits. This is only useful when both the client and server are
running on the same host.

If nfsd_open_local_fh() fails then the NFS client will both retry and
fallback to normal network-based read, write and commit operations if
localio is no longer supported.

Care is taken to ensure the same NFS security mechanisms are used
(authentication, etc) regardless of whether localio or regular NFS
access is used.  The auth_domain established as part of the traditional
NFS client access to the NFS server is also used for localio.  Store
auth_domain for localio in nfsd_uuid_t and transfer it to the client
if it is local to the server.

Relative to containers, localio gives the client access to the network
namespace the server has.  This is required to allow the client to
access the server's per-namespace nfsd_net struct.

This commit also introduces the use of NFSD's percpu_ref to interlock
nfsd_destroy_serv and nfsd_open_local_fh, to ensure nn->nfsd_serv is
not destroyed while in use by nfsd_open_local_fh and other LOCALIO
client code.

CONFIG_NFS_LOCALIO enables NFS server support for LOCALIO.

Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Co-developed-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Co-developed-by: NeilBrown <neilb@suse.de>
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Acked-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2024-09-23 15:03:30 -04:00