Commit graph

53 commits

Author SHA1 Message Date
Linus Torvalds
672dcda246 vfs-6.17-rc1.pidfs
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaINCiQAKCRCRxhvAZXjc
 orltAQDq3y1anYETz5/FD6P2gXY1W5hXdSm3EHHeacQ1JjTXvgEA2g1lWO7J4anf
 oOVE8aSvMow/FOjivLZBYmI65pkYJAE=
 =oDKB
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.17-rc1.pidfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull pidfs updates from Christian Brauner:

 - persistent info

   Persist exit and coredump information independent of whether anyone
   currently holds a pidfd for the struct pid.

   The current scheme allocated pidfs dentries on-demand repeatedly.
   This scheme is reaching it's limits as it makes it impossible to pin
   information that needs to be available after the task has exited or
   coredumped and that should not be lost simply because the pidfd got
   closed temporarily. The next opener should still see the stashed
   information.

   This is also a prerequisite for supporting extended attributes on
   pidfds to allow attaching meta information to them.

   If someone opens a pidfd for a struct pid a pidfs dentry is allocated
   and stashed in pid->stashed. Once the last pidfd for the struct pid
   is closed the pidfs dentry is released and removed from pid->stashed.

   So if 10 callers create a pidfs dentry for the same struct pid
   sequentially, i.e., each closing the pidfd before the other creates a
   new one then a new pidfs dentry is allocated every time.

   Because multiple tasks acquiring and releasing a pidfd for the same
   struct pid can race with each another a task may still find a valid
   pidfs entry from the previous task in pid->stashed and reuse it. Or
   it might find a dead dentry in there and fail to reuse it and so
   stashes a new pidfs dentry. Multiple tasks may race to stash a new
   pidfs dentry but only one will succeed, the other ones will put their
   dentry.

   The current scheme aims to ensure that a pidfs dentry for a struct
   pid can only be created if the task is still alive or if a pidfs
   dentry already existed before the task was reaped and so exit
   information has been was stashed in the pidfs inode.

   That's great except that it's buggy. If a pidfs dentry is stashed in
   pid->stashed after pidfs_exit() but before __unhash_process() is
   called we will return a pidfd for a reaped task without exit
   information being available.

   The pidfds_pid_valid() check does not guard against this race as it
   doens't sync at all with pidfs_exit(). The pid_has_task() check might
   be successful simply because we're before __unhash_process() but
   after pidfs_exit().

   Introduce a new scheme where the lifetime of information associated
   with a pidfs entry (coredump and exit information) isn't bound to the
   lifetime of the pidfs inode but the struct pid itself.

   The first time a pidfs dentry is allocated for a struct pid a struct
   pidfs_attr will be allocated which will be used to store exit and
   coredump information.

   If all pidfs for the pidfs dentry are closed the dentry and inode can
   be cleaned up but the struct pidfs_attr will stick until the struct
   pid itself is freed. This will ensure minimal memory usage while
   persisting relevant information.

   The new scheme has various advantages. First, it allows to close the
   race where we end up handing out a pidfd for a reaped task for which
   no exit information is available. Second, it minimizes memory usage.
   Third, it allows to remove complex lifetime tracking via dentries
   when registering a struct pid with pidfs. There's no need to get or
   put a reference. Instead, the lifetime of exit and coredump
   information associated with a struct pid is bound to the lifetime of
   struct pid itself.

 - extended attributes

   Now that we have a way to persist information for pidfs dentries we
   can start supporting extended attributes on pidfds. This will allow
   userspace to attach meta information to tasks.

   One natural extension would be to introduce a custom pidfs.* extended
   attribute space and allow for the inheritance of extended attributes
   across fork() and exec().

   The first simple scheme will allow privileged userspace to set
   trusted extended attributes on pidfs inodes.

 - Allow autonomous pidfs file handles

   Various filesystems such as pidfs and drm support opening file
   handles without having to require a file descriptor to identify the
   filesystem. The filesystem are global single instances and can be
   trivially identified solely on the information encoded in the file
   handle.

   This makes it possible to not have to keep or acquire a sentinal file
   descriptor just to pass it to open_by_handle_at() to identify the
   filesystem. That's especially useful when such sentinel file
   descriptor cannot or should not be acquired.

   For pidfs this means a file handle can function as full replacement
   for storing a pid in a file. Instead a file handle can be stored and
   reopened purely based on the file handle.

   Such autonomous file handles can be opened with or without specifying
   a a file descriptor. If no proper file descriptor is used the
   FD_PIDFS_ROOT sentinel must be passed. This allows us to define
   further special negative fd sentinels in the future.

   Userspace can trivially test for support by trying to open the file
   handle with an invalid file descriptor.

 - Allow pidfds for reaped tasks with SCM_PIDFD messages

   This is a logical continuation of the earlier work to create pidfds
   for reaped tasks through the SO_PEERPIDFD socket option merged in
   923ea4d448 ("Merge patch series "net, pidfs: enable handing out
   pidfds for reaped sk->sk_peer_pid"").

 - Two minor fixes:

    * Fold fs_struct->{lock,seq} into a seqlock

    * Don't bother with path_{get,put}() in unix_open_file()

* tag 'vfs-6.17-rc1.pidfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (37 commits)
  don't bother with path_get()/path_put() in unix_open_file()
  fold fs_struct->{lock,seq} into a seqlock
  selftests: net: extend SCM_PIDFD test to cover stale pidfds
  af_unix: enable handing out pidfds for reaped tasks in SCM_PIDFD
  af_unix: stash pidfs dentry when needed
  af_unix/scm: fix whitespace errors
  af_unix: introduce and use scm_replace_pid() helper
  af_unix: introduce unix_skb_to_scm helper
  af_unix: rework unix_maybe_add_creds() to allow sleep
  selftests/pidfd: decode pidfd file handles withou having to specify an fd
  fhandle, pidfs: support open_by_handle_at() purely based on file handle
  uapi/fcntl: add FD_PIDFS_ROOT
  uapi/fcntl: add FD_INVALID
  fcntl/pidfd: redefine PIDFD_SELF_THREAD_GROUP
  uapi/fcntl: mark range as reserved
  fhandle: reflow get_path_anchor()
  pidfs: add pidfs_root_path() helper
  fhandle: rename to get_path_anchor()
  fhandle: hoist copy_from_user() above get_path_from_fd()
  fhandle: raise FILEID_IS_DIR in handle_type
  ...
2025-07-28 14:10:15 -07:00
Laura Brehm
830a9e37cf
coredump: fix PIDFD_INFO_COREDUMP ioctl check
In Commit 1d8db6fd69 ("pidfs,
coredump: add PIDFD_INFO_COREDUMP"), the following code was added:

    if (mask & PIDFD_INFO_COREDUMP) {
        kinfo.mask |= PIDFD_INFO_COREDUMP;
        kinfo.coredump_mask = READ_ONCE(pidfs_i(inode)->__pei.coredump_mask);
    }
    [...]
    if (!(kinfo.mask & PIDFD_INFO_COREDUMP)) {
        task_lock(task);
        if (task->mm)
            kinfo.coredump_mask = pidfs_coredump_mask(task->mm->flags);
        task_unlock(task);
    }

The second bit in particular looks off to me - the condition in essence
checks whether PIDFD_INFO_COREDUMP was **not** requested, and if so
fetches the coredump_mask in kinfo, since it's checking !(kinfo.mask &
PIDFD_INFO_COREDUMP), which is unconditionally set in the earlier hunk.

I'm tempted to assume the idea in the second hunk was to calculate the
coredump mask if one was requested but fetched in the first hunk, in
which case the check should be
    if ((kinfo.mask & PIDFD_INFO_COREDUMP) && !(kinfo.coredump_mask))
which might be more legibly written as
    if ((mask & PIDFD_INFO_COREDUMP) && !(kinfo.coredump_mask))

This could also instead be achieved by changing the first hunk to be:

    if (mask & PIDFD_INFO_COREDUMP) {
	kinfo.coredump_mask = READ_ONCE(pidfs_i(inode)->__pei.coredump_mask);
	if (kinfo.coredump_mask)
	    kinfo.mask |= PIDFD_INFO_COREDUMP;
    }

and the second hunk to:

    if ((mask & PIDFD_INFO_COREDUMP) && !(kinfo.mask & PIDFD_INFO_COREDUMP)) {
	task_lock(task);
        if (task->mm) {
	    kinfo.coredump_mask = pidfs_coredump_mask(task->mm->flags);
            kinfo.mask |= PIDFD_INFO_COREDUMP;
        }
        task_unlock(task);
    }

However, when looking at this, the supposition that the second hunk
means to cover cases where the coredump info was requested but the first
hunk failed to get it starts getting doubtful, so apologies if I'm
completely off-base.

This patch addresses the issue by fixing the check in the second hunk.

Signed-off-by: Laura Brehm <laurabrehm@hey.com>
Link: https://lore.kernel.org/20250703120244.96908-3-laurabrehm@hey.com
Cc: brauner@kernel.org
Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-07 13:26:03 +02:00
Christian Brauner
a0d8051cfd
pidfs: add pidfs_root_path() helper
Allow to return the root of the global pidfs filesystem.

Link: https://lore.kernel.org/20250624-work-pidfs-fhandle-v2-4-d02a04858fe3@kernel.org
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-06-24 13:00:09 +02:00
Christian Brauner
f077638b5f
pidfs: fix pidfs_free_pid()
Ensure that we handle the case where task creation fails and pid->attr
was never accessed at all.

Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-06-23 14:51:40 +02:00
Christian Brauner
f9fac1f48c
pidfs: add some CONFIG_DEBUG_VFS asserts
Allow to catch some obvious bugs.

Link: https://lore.kernel.org/20250618-work-pidfs-persistent-v2-16-98f3456fd552@kernel.org
Reviewed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-06-23 12:22:08 +02:00
Christian Brauner
91d837cae3
pidfs: support xattrs on pidfds
Now that we have a way to persist information for pidfs dentries we can
start supporting extended attributes on pidfds. This will allow
userspace to attach meta information to tasks.

One natural extension would be to introduce a custom pidfs.* extended
attribute space and allow for the inheritance of extended attributes
across fork() and exec().

The first simple scheme will allow privileged userspace to set trusted
extended attributes on pidfs inodes.

Link: https://lore.kernel.org/20250618-work-pidfs-persistent-v2-12-98f3456fd552@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-06-23 12:21:37 +02:00
Christian Brauner
f769b3db24 pidfs: make inodes mutable
Prepare for allowing extended attributes to be set on pidfd inodes by
allowing them to be mutable.

Link: https://lore.kernel.org/20250618-work-pidfs-persistent-v2-11-98f3456fd552@kernel.org
Reviewed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-06-19 14:28:25 +02:00
Christian Brauner
d718249bba pidfs: remove pidfs_pid_valid()
The validation is now completely handled in path_from_stashed().

Link: https://lore.kernel.org/20250618-work-pidfs-persistent-v2-9-98f3456fd552@kernel.org
Reviewed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-06-19 14:28:25 +02:00
Christian Brauner
804d679449 pidfs: remove pidfs_{get,put}_pid()
Now that we stash persistent information in struct pid there's no need
to play volatile games with pinning struct pid via dentries in pidfs.

Link: https://lore.kernel.org/20250618-work-pidfs-persistent-v2-8-98f3456fd552@kernel.org
Reviewed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-06-19 14:28:24 +02:00
Christian Brauner
0f93d71b9d pidfs: remove custom inode allocation
We don't need it anymore as persistent information is allocated lazily
and stashed in struct pid.

Link: https://lore.kernel.org/20250618-work-pidfs-persistent-v2-7-98f3456fd552@kernel.org
Reviewed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-06-19 14:28:24 +02:00
Christian Brauner
5ee83f8d1a pidfs: remove unused members from struct pidfs_inode
We've moved persistent information to struct pid.
So there's no need for these anymore.

Link: https://lore.kernel.org/20250618-work-pidfs-persistent-v2-6-98f3456fd552@kernel.org
Reviewed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-06-19 14:28:24 +02:00
Christian Brauner
8ec7c826d9 pidfs: persist information
Persist exit and coredump information independent of whether anyone
currently holds a pidfd for the struct pid.

The current scheme allocated pidfs dentries on-demand repeatedly.
This scheme is reaching it's limits as it makes it impossible to pin
information that needs to be available after the task has exited or
coredumped and that should not be lost simply because the pidfd got
closed temporarily. The next opener should still see the stashed
information.

This is also a prerequisite for supporting extended attributes on
pidfds to allow attaching meta information to them.

If someone opens a pidfd for a struct pid a pidfs dentry is allocated
and stashed in pid->stashed. Once the last pidfd for the struct pid is
closed the pidfs dentry is released and removed from pid->stashed.

So if 10 callers create a pidfs dentry for the same struct pid
sequentially, i.e., each closing the pidfd before the other creates a
new one then a new pidfs dentry is allocated every time.

Because multiple tasks acquiring and releasing a pidfd for the same
struct pid can race with each another a task may still find a valid
pidfs entry from the previous task in pid->stashed and reuse it. Or it
might find a dead dentry in there and fail to reuse it and so stashes a
new pidfs dentry. Multiple tasks may race to stash a new pidfs dentry
but only one will succeed, the other ones will put their dentry.

The current scheme aims to ensure that a pidfs dentry for a struct pid
can only be created if the task is still alive or if a pidfs dentry
already existed before the task was reaped and so exit information has
been was stashed in the pidfs inode.

That's great except that it's buggy. If a pidfs dentry is stashed in
pid->stashed after pidfs_exit() but before __unhash_process() is called
we will return a pidfd for a reaped task without exit information being
available.

The pidfds_pid_valid() check does not guard against this race as it
doens't sync at all with pidfs_exit(). The pid_has_task() check might be
successful simply because we're before __unhash_process() but after
pidfs_exit().

Introduce a new scheme where the lifetime of information associated with
a pidfs entry (coredump and exit information) isn't bound to the
lifetime of the pidfs inode but the struct pid itself.

The first time a pidfs dentry is allocated for a struct pid a struct
pidfs_attr will be allocated which will be used to store exit and
coredump information.

If all pidfs for the pidfs dentry are closed the dentry and inode can be
cleaned up but the struct pidfs_attr will stick until the struct pid
itself is freed. This will ensure minimal memory usage while persisting
relevant information.

The new scheme has various advantages. First, it allows to close the
race where we end up handing out a pidfd for a reaped task for which no
exit information is available. Second, it minimizes memory usage.
Third, it allows to remove complex lifetime tracking via dentries when
registering a struct pid with pidfs. There's no need to get or put a
reference. Instead, the lifetime of exit and coredump information
associated with a struct pid is bound to the lifetime of struct pid
itself.

Link: https://lore.kernel.org/20250618-work-pidfs-persistent-v2-5-98f3456fd552@kernel.org
Reviewed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-06-19 14:28:24 +02:00
Christian Brauner
1a1ad73aa1 pidfs: raise SB_I_NODEV and SB_I_NOEXEC
Similar to commit 1ed95281c0 ("anon_inode: raise SB_I_NODEV and SB_I_NOEXEC"):
it shouldn't be possible to execute pidfds via
execveat(fd_anon_inode, "", NULL, NULL, AT_EMPTY_PATH)
so raise SB_I_NOEXEC so that no one gets any creative ideas.

Also raise SB_I_NODEV as we don't expect or support any devices on pidfs.

Link: https://lore.kernel.org/20250618-work-pidfs-persistent-v2-1-98f3456fd552@kernel.org
Reviewed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-06-19 14:28:23 +02:00
Mike Yuan
b55eb6eb2a
pidfs: never refuse ppid == 0 in PIDFD_GET_INFO
In systemd we spotted an issue after switching to ioctl(PIDFD_GET_INFO)
for obtaining pid number the pidfd refers to, that for processes
with a parent from outer pidns PIDFD_GET_INFO unexpectedly yields
-ESRCH [1]. It turned out that there's an arbitrary check blocking
this, which is not really sensible given getppid() happily returns
0 for such processes. Just drop the spurious check and userspace
ought to handle ppid == 0 properly everywhere.

[1] https://github.com/systemd/systemd/issues/37715

Fixes: cdda1f26e7 ("pidfd: add ioctl to retrieve pid info")
Signed-off-by: Mike Yuan <me@yhndnzj.com>
Link: https://lore.kernel.org/20250604150238.42664-1-me@yhndnzj.com
Cc: Christian Brauner <brauner@kernel.org>
Cc: Luca Boccassi <luca.boccassi@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-06-04 22:48:32 +02:00
Linus Torvalds
c5bfc48d54 vfs-6.16-rc1.coredump
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaDBPTwAKCRCRxhvAZXjc
 oliqAQCVdrBn7D2+dB04hjefFq6W6LhyLGrtCCliflicN5SyxAD+PHHiB9nFKe6J
 xQkaNArCJjPd2QEx73aGjHzi3UQq6Qs=
 =Pk9c
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.16-rc1.coredump' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull coredump updates from Christian Brauner:
 "This adds support for sending coredumps over an AF_UNIX socket. It
  also makes (implicit) use of the new SO_PEERPIDFD ability to hand out
  pidfds for reaped peer tasks

  The new coredump socket will allow userspace to not have to rely on
  usermode helpers for processing coredumps and provides a saf way to
  handle them instead of relying on super privileged coredumping helpers

  This will also be significantly more lightweight since the kernel
  doens't have to do a fork()+exec() for each crashing process to spawn
  a usermodehelper. Instead the kernel just connects to the AF_UNIX
  socket and userspace can process it concurrently however it sees fit.
  Support for userspace is incoming starting with systemd-coredump

  There's more work coming in that direction next cycle. The rest below
  goes into some details and background

  Coredumping currently supports two modes:

   (1) Dumping directly into a file somewhere on the filesystem.

   (2) Dumping into a pipe connected to a usermode helper process
       spawned as a child of the system_unbound_wq or kthreadd

  For simplicity I'm mostly ignoring (1). There's probably still some
  users of (1) out there but processing coredumps in this way can be
  considered adventurous especially in the face of set*id binaries

  The most common option should be (2) by now. It works by allowing
  userspace to put a string into /proc/sys/kernel/core_pattern like:

          |/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h

  The "|" at the beginning indicates to the kernel that a pipe must be
  used. The path following the pipe indicator is a path to a binary that
  will be spawned as a usermode helper process. Any additional
  parameters pass information about the task that is generating the
  coredump to the binary that processes the coredump

  In the example the core_pattern shown causes the kernel to spawn
  systemd-coredump as a usermode helper. There's various conceptual
  consequences of this (non-exhaustive list):

   - systemd-coredump is spawned with file descriptor number 0 (stdin)
     connected to the read-end of the pipe. All other file descriptors
     are closed. That specifically includes 1 (stdout) and 2 (stderr).

     This has already caused bugs because userspace assumed that this
     cannot happen (Whether or not this is a sane assumption is
     irrelevant)

   - systemd-coredump will be spawned as a child of system_unbound_wq.
     So it is not a child of any userspace process and specifically not
     a child of PID 1. It cannot be waited upon and is in a weird hybrid
     upcall which are difficult for userspace to control correctly

   - systemd-coredump is spawned with full kernel privileges. This
     necessitates all kinds of weird privilege dropping excercises in
     userspace to make this safe

   - A new usermode helper has to be spawned for each crashing process

  This adds a new mode:

   (3) Dumping into an AF_UNIX socket

  Userspace can set /proc/sys/kernel/core_pattern to:

          @/path/to/coredump.socket

  The "@" at the beginning indicates to the kernel that an AF_UNIX
  coredump socket will be used to process coredumps

  The coredump socket must be located in the initial mount namespace.
  When a task coredumps it opens a client socket in the initial network
  namespace and connects to the coredump socket:

   - The coredump server uses SO_PEERPIDFD to get a stable handle on the
     connected crashing task. The retrieved pidfd will provide a stable
     reference even if the crashing task gets SIGKILLed while generating
     the coredump. That is a huge attack vector right now

   - By setting core_pipe_limit non-zero userspace can guarantee that
     the crashing task cannot be reaped behind it's back and thus
     process all necessary information in /proc/<pid>. The SO_PEERPIDFD
     can be used to detect whether /proc/<pid> still refers to the same
     process

     The core_pipe_limit isn't used to rate-limit connections to the
     socket. This can simply be done via AF_UNIX socket directly

   - The pidfd for the crashing task will contain information how the
     task coredumps. The PIDFD_GET_INFO ioctl gained a new flag
     PIDFD_INFO_COREDUMP which can be used to retreive the coredump
     information

     If the coredump gets a new coredump client connection the kernel
     guarantees that PIDFD_INFO_COREDUMP information is available.

     Currently the following information is provided in the new
     @coredump_mask extension to struct pidfd_info:

      * PIDFD_COREDUMPED is raised if the task did actually coredump

      * PIDFD_COREDUMP_SKIP is raised if the task skipped coredumping
        (e.g., undumpable)

      * PIDFD_COREDUMP_USER is raised if this is a regular coredump and
        doesn't need special care by the coredump server

      * PIDFD_COREDUMP_ROOT is raised if the generated coredump should
        be treated as sensitive and the coredump server should restrict
        access to the generated coredump to sufficiently privileged
        users"

* tag 'vfs-6.16-rc1.coredump' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  mips, net: ensure that SOCK_COREDUMP is defined
  selftests/coredump: add tests for AF_UNIX coredumps
  selftests/pidfd: add PIDFD_INFO_COREDUMP infrastructure
  coredump: validate socket name as it is written
  coredump: show supported coredump modes
  pidfs, coredump: add PIDFD_INFO_COREDUMP
  coredump: add coredump socket
  coredump: reflow dump helpers a little
  coredump: massage do_coredump()
  coredump: massage format_corename()
2025-05-26 11:17:01 -07:00
Linus Torvalds
7d7a103d29 vfs-6.16-rc1.pidfs
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaDBPTwAKCRCRxhvAZXjc
 ov4zAP4yfqKBAz6eMt9CzDgHCdVQJ9Nuur1EiRdot3maPzHTcQEA2hVkJrvVo1Y/
 jCVAf7gmGX1Uu6nCUF6Vjy35g8i20gA=
 =nzYS
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.16-rc1.pidfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull pidfs updates from Christian Brauner:
 "Features:

   - Allow handing out pidfds for reaped tasks for AF_UNIX SO_PEERPIDFD
     socket option

     SO_PEERPIDFD is a socket option that allows to retrieve a pidfd for
     the process that called connect() or listen(). This is heavily used
     to safely authenticate clients in userspace avoiding security bugs
     due to pid recycling races (dbus, polkit, systemd, etc.)

     SO_PEERPIDFD currently doesn't support handing out pidfds if the
     sk->sk_peer_pid thread-group leader has already been reaped. In
     this case it currently returns EINVAL. Userspace still wants to get
     a pidfd for a reaped process to have a stable handle it can pass
     on. This is especially useful now that it is possible to retrieve
     exit information through a pidfd via the PIDFD_GET_INFO ioctl()'s
     PIDFD_INFO_EXIT flag

     Another summary has been provided by David Rheinsberg:

      > A pidfd can outlive the task it refers to, and thus user-space
      > must already be prepared that the task underlying a pidfd is
      > gone at the time they get their hands on the pidfd. For
      > instance, resolving the pidfd to a PID via the fdinfo must be
      > prepared to read `-1`.
      >
      > Despite user-space knowing that a pidfd might be stale, several
      > kernel APIs currently add another layer that checks for this. In
      > particular, SO_PEERPIDFD returns `EINVAL` if the peer-task was
      > already reaped, but returns a stale pidfd if the task is reaped
      > immediately after the respective alive-check.
      >
      > This has the unfortunate effect that user-space now has two ways
      > to check for the exact same scenario: A syscall might return
      > EINVAL/ESRCH/... *or* the pidfd might be stale, even though
      > there is no particular reason to distinguish both cases. This
      > also propagates through user-space APIs, which pass on pidfds.
      > They must be prepared to pass on `-1` *or* the pidfd, because
      > there is no guaranteed way to get a stale pidfd from the kernel.
      >
      > Userspace must already deal with a pidfd referring to a reaped
      > task as the task may exit and get reaped at any time will there
      > are still many pidfds referring to it

     In order to allow handing out reaped pidfd SO_PEERPIDFD needs to
     ensure that PIDFD_INFO_EXIT information is available whenever a
     pidfd for a reaped task is created by PIDFD_INFO_EXIT. The uapi
     promises that reaped pidfds are only handed out if it is guaranteed
     that the caller sees the exit information:

     TEST_F(pidfd_info, success_reaped)
     {
             struct pidfd_info info = {
                     .mask = PIDFD_INFO_CGROUPID | PIDFD_INFO_EXIT,
             };

             /*
              * Process has already been reaped and PIDFD_INFO_EXIT been set.
              * Verify that we can retrieve the exit status of the process.
              */
             ASSERT_EQ(ioctl(self->child_pidfd4, PIDFD_GET_INFO, &info), 0);
             ASSERT_FALSE(!!(info.mask & PIDFD_INFO_CREDS));
             ASSERT_TRUE(!!(info.mask & PIDFD_INFO_EXIT));
             ASSERT_TRUE(WIFEXITED(info.exit_code));
             ASSERT_EQ(WEXITSTATUS(info.exit_code), 0);
     }

     To hand out pidfds for reaped processes we thus allocate a pidfs
     entry for the relevant sk->sk_peer_pid at the time the
     sk->sk_peer_pid is stashed and drop it when the socket is
     destroyed. This guarantees that exit information will always be
     recorded for the sk->sk_peer_pid task and we can hand out pidfds
     for reaped processes

   - Hand a pidfd to the coredump usermode helper process

     Give userspace a way to instruct the kernel to install a pidfd for
     the crashing process into the process started as a usermode helper.
     There's still tricky race-windows that cannot be easily or
     sometimes not closed at all by userspace. There's various ways like
     looking at the start time of a process to make sure that the
     usermode helper process is started after the crashing process but
     it's all very very brittle and fraught with peril

     The crashed-but-not-reaped process can be killed by userspace
     before coredump processing programs like systemd-coredump have had
     time to manually open a PIDFD from the PID the kernel provides
     them, which means they can be tricked into reading from an
     arbitrary process, and they run with full privileges as they are
     usermode helper processes

     Even if that specific race-window wouldn't exist it's still the
     safest and cleanest way to let the kernel provide the pidfd
     directly instead of requiring userspace to do it manually. In
     parallel with this commit we already have systemd adding support
     for this in [1]

     When the usermode helper process is forked we install a pidfd file
     descriptor three into the usermode helper's file descriptor table
     so it's available to the exec'd program

     Since usermode helpers are either children of the system_unbound_wq
     workqueue or kthreadd we know that the file descriptor table is
     empty and can thus always use three as the file descriptor number

     Note, that we'll install a pidfd for the thread-group leader even
     if a subthread is calling do_coredump(). We know that task linkage
     hasn't been removed yet and even if this @current isn't the actual
     thread-group leader we know that the thread-group leader cannot be
     reaped until
     @current has exited

   - Allow telling when a task has not been found from finding the wrong
     task when creating a pidfd

     We currently report EINVAL whenever a struct pid has no tasked
     attached anymore thereby conflating two concepts:

      (1) The task has already been reaped

      (2) The caller requested a pidfd for a thread-group leader but the
          pid actually references a struct pid that isn't used as a
          thread-group leader

     This is causing issues for non-threaded workloads as in where they
     expect ESRCH to be reported, not EINVAL

     So allow userspace to reliably distinguish between (1) and (2)

   - Make it possible to detect when a pidfs entry would outlive the
     struct pid it pinned

   - Add a range of new selftests

  Cleanups:

   - Remove unneeded NULL check from pidfd_prepare() for passed struct
     pid

   - Avoid pointless reference count bump during release_task()

  Fixes:

   - Various fixes to the pidfd and coredump selftests

   - Fix error handling for replace_fd() when spawning coredump usermode
     helper"

* tag 'vfs-6.16-rc1.pidfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  pidfs: detect refcount bugs
  coredump: hand a pidfd to the usermode coredump helper
  coredump: fix error handling for replace_fd()
  pidfs: move O_RDWR into pidfs_alloc_file()
  selftests: coredump: Raise timeout to 2 minutes
  selftests: coredump: Fix test failure for slow machines
  selftests: coredump: Properly initialize pointer
  net, pidfs: enable handing out pidfds for reaped sk->sk_peer_pid
  pidfs: get rid of __pidfd_prepare()
  net, pidfs: prepare for handing out pidfds for reaped sk->sk_peer_pid
  pidfs: register pid in pidfs
  net, pidfd: report EINVAL for ESRCH
  release_task: kill the no longer needed get/put_pid(thread_pid)
  pidfs: ensure consistent ENOENT/ESRCH reporting
  exit: move wake_up_all() pidfd waiters into __unhash_process()
  selftest/pidfd: add test for thread-group leader pidfd open for thread
  pidfd: improve uapi when task isn't found
  pidfd: remove unneeded NULL check from pidfd_prepare()
  selftests/pidfd: adapt to recent changes
2025-05-26 10:30:02 -07:00
Christian Brauner
1d8db6fd69
pidfs, coredump: add PIDFD_INFO_COREDUMP
Extend the PIDFD_INFO_COREDUMP ioctl() with the new PIDFD_INFO_COREDUMP
mask flag. This adds the @coredump_mask field to struct pidfd_info.

When a task coredumps the kernel will provide the following information
to userspace in @coredump_mask:

* PIDFD_COREDUMPED is raised if the task did actually coredump.
* PIDFD_COREDUMP_SKIP is raised if the task skipped coredumping (e.g.,
  undumpable).
* PIDFD_COREDUMP_USER is raised if this is a regular coredump and
  doesn't need special care by the coredump server.
* PIDFD_COREDUMP_ROOT is raised if the generated coredump should be
  treated as sensitive and the coredump server should restrict to the
  generated coredump to sufficiently privileged users.

The kernel guarantees that by the time the connection is made the all
PIDFD_INFO_COREDUMP info is available.

Link: https://lore.kernel.org/20250516-work-coredump-socket-v8-5-664f3caf2516@kernel.org
Acked-by: Luca Boccassi <luca.boccassi@gmail.com>
Reviewed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Reviewed-by: Jann Horn <jannh@google.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-05-21 13:59:12 +02:00
Christian Brauner
c57f07b235
pidfs: move O_RDWR into pidfs_alloc_file()
Since all pidfds must be O_RDWR currently enfore that directly in the
file allocation function itself instead of letting callers specify it.

Link: https://lore.kernel.org/20250414-work-coredump-v2-1-685bf231f828@kernel.org
Tested-by: Luca Boccassi <luca.boccassi@gmail.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-05-02 14:28:45 +02:00
Christian Brauner
a71f402acd
pidfs: get rid of __pidfd_prepare()
Fold it into pidfd_prepare() and rename PIDFD_CLONE to PIDFD_STALE to
indicate that the passed pid might not have task linkage and no explicit
check for that should be performed.

Link: https://lore.kernel.org/20250425-work-pidfs-net-v2-3-450a19461e75@kernel.org
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: David Rheinsberg <david@readahead.eu>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-04-26 08:28:03 +02:00
Christian Brauner
477058411c
pidfs: register pid in pidfs
Add simple helpers that allow a struct pid to be pinned via a pidfs
dentry/inode. If no pidfs dentry exists a new one will be allocated for
it. A reference is taken by pidfs on @pid. The reference must be
released via pidfs_put_pid().

This will allow AF_UNIX sockets to allocate a dentry for the peer
credentials pid at the time they are recorded where we know the task is
still alive. When the task gets reaped its exit status is guaranteed to
be recorded and a pidfd can be handed out for the reaped task.

Link: https://lore.kernel.org/20250425-work-pidfs-net-v2-1-450a19461e75@kernel.org
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: David Rheinsberg <david@readahead.eu>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-04-26 08:27:33 +02:00
Christian Brauner
19bbfe7b5f
fs: add S_ANON_INODE
This makes it easy to detect proper anonymous inodes and to ensure that
we can detect them in codepaths such as readahead().

Readahead on anonymous inodes didn't work because they didn't have a
proper mode. Now that they have we need to retain EINVAL being returned
otherwise LTP will fail.

We also need to ensure that ioctls aren't simply fired like they are for
regular files so things like inotify inodes continue to correctly call
their own ioctl handlers as in [1].

Reported-by: Xilin Wu <sophon@radxa.com>
Link: https://lore.kernel.org/3A9139D5CD543962+89831381-31b9-4392-87ec-a84a5b3507d8@radxa.com [1]
Link: https://lore.kernel.org/7a1a7076-ff6b-4cb0-94e7-7218a0a44028@sirena.org.uk
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-04-21 13:20:14 +02:00
Christian Brauner
c83b902496
pidfs: use anon_inode_setattr()
So far pidfs did use it's own version. Just use the generic version.
We use our own wrappers because we're going to be implementing
properties soon.

Link: https://lore.kernel.org/20250407-work-anon_inode-v1-4-53a44c20d44e@kernel.org
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-04-07 16:19:02 +02:00
Christian Brauner
37e62dafbf
pidfs: use anon_inode_getattr()
So far pidfs did use it's own version. Just use the generic version. We
use our own wrappers because we're going to be implementing our own
retrieval properties soon.

Link: https://lore.kernel.org/20250407-work-anon_inode-v1-2-53a44c20d44e@kernel.org
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-04-07 16:18:56 +02:00
Linus Torvalds
df00ded23a vfs-6.15-rc1.pidfs
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZ90pqgAKCRCRxhvAZXjc
 oqVsAP9Aq/fMCI14HeXehPCezKQZPu1HTrPPo2clLHXoSnafawEAsA3YfWTT4Heb
 iexzqvAEUOMYOVN66QEc+6AAwtMLrwc=
 =0eYo
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.15-rc1.pidfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull vfs pidfs updates from Christian Brauner:

 - Allow retrieving exit information after a process has been reaped
   through pidfds via the new PIDFD_INTO_EXIT extension for the
   PIDFD_GET_INFO ioctl. Various tools need access to information about
   a process/task even after it has already been reaped.

   Pidfd polling allows waiting on either task exit or for a task to
   have been reaped. The contract for PIDFD_INFO_EXIT is simply that
   EPOLLHUP must be observed before exit information can be retrieved,
   i.e., exit information is only provided once the task has been reaped
   and then can be retrieved as long as the pidfd is open.

 - Add PIDFD_SELF_{THREAD,THREAD_GROUP} sentinels allowing userspace to
   forgo allocating a file descriptor for their own process. This is
   useful in scenarios where users want to act on their own process
   through pidfds and is akin to AT_FDCWD.

 - Improve premature thread-group leader and subthread exec behavior
   when polling on pidfds:

   (1) During a multi-threaded exec by a subthread, i.e.,
       non-thread-group leader thread, all other threads in the
       thread-group including the thread-group leader are killed and the
       struct pid of the thread-group leader will be taken over by the
       subthread that called exec. IOW, two tasks change their TIDs.

   (2) A premature thread-group leader exit means that the thread-group
       leader exited before all of the other subthreads in the
       thread-group have exited.

   Both cases lead to inconsistencies for pidfd polling with
   PIDFD_THREAD. Any caller that holds a PIDFD_THREAD pidfd to the
   current thread-group leader may or may not see an exit notification
   on the file descriptor depending on when poll is performed. If the
   poll is performed before the exec of the subthread has concluded an
   exit notification is generated for the old thread-group leader. If
   the poll is performed after the exec of the subthread has concluded
   no exit notification is generated for the old thread-group leader.

   The correct behavior is to simply not generate an exit notification
   on the struct pid of a subhthread exec because the struct pid is
   taken over by the subthread and thus remains alive.

   But this is difficult to handle because a thread-group may exit
   premature as mentioned in (2). In that case an exit notification is
   reliably generated but the subthreads may continue to run for an
   indeterminate amount of time and thus also may exec at some point.

   After this pull no exit notifications will be generated for a
   PIDFD_THREAD pidfd for a thread-group leader until all subthreads
   have been reaped. If a subthread should exec before no exit
   notification will be generated until that task exits or it creates
   subthreads and repeates the cycle.

   This means an exit notification indicates the ability for the father
   to reap the child.

* tag 'vfs-6.15-rc1.pidfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (25 commits)
  selftests/pidfd: third test for multi-threaded exec polling
  selftests/pidfd: second test for multi-threaded exec polling
  selftests/pidfd: first test for multi-threaded exec polling
  pidfs: improve multi-threaded exec and premature thread-group leader exit polling
  pidfs: ensure that PIDFS_INFO_EXIT is available
  selftests/pidfd: add seventh PIDFD_INFO_EXIT selftest
  selftests/pidfd: add sixth PIDFD_INFO_EXIT selftest
  selftests/pidfd: add fifth PIDFD_INFO_EXIT selftest
  selftests/pidfd: add fourth PIDFD_INFO_EXIT selftest
  selftests/pidfd: add third PIDFD_INFO_EXIT selftest
  selftests/pidfd: add second PIDFD_INFO_EXIT selftest
  selftests/pidfd: add first PIDFD_INFO_EXIT selftest
  selftests/pidfd: expand common pidfd header
  pidfs/selftests: ensure correct headers for ioctl handling
  selftests/pidfd: fix header inclusion
  pidfs: allow to retrieve exit information
  pidfs: record exit code and cgroupid at exit
  pidfs: use private inode slab cache
  pidfs: move setting flags into pidfs_alloc_file()
  pidfd: rely on automatic cleanup in __pidfd_prepare()
  ...
2025-03-24 10:16:37 -07:00
Christian Brauner
0fb482728b
pidfs: improve multi-threaded exec and premature thread-group leader exit polling
This is another attempt trying to make pidfd polling for multi-threaded
exec and premature thread-group leader exit consistent.

A quick recap of these two cases:

(1) During a multi-threaded exec by a subthread, i.e., non-thread-group
    leader thread, all other threads in the thread-group including the
    thread-group leader are killed and the struct pid of the
    thread-group leader will be taken over by the subthread that called
    exec. IOW, two tasks change their TIDs.

(2) A premature thread-group leader exit means that the thread-group
    leader exited before all of the other subthreads in the thread-group
    have exited.

Both cases lead to inconsistencies for pidfd polling with PIDFD_THREAD.
Any caller that holds a PIDFD_THREAD pidfd to the current thread-group
leader may or may not see an exit notification on the file descriptor
depending on when poll is performed. If the poll is performed before the
exec of the subthread has concluded an exit notification is generated
for the old thread-group leader. If the poll is performed after the exec
of the subthread has concluded no exit notification is generated for the
old thread-group leader.

The correct behavior would be to simply not generate an exit
notification on the struct pid of a subhthread exec because the struct
pid is taken over by the subthread and thus remains alive.

But this is difficult to handle because a thread-group may exit
prematurely as mentioned in (2). In that case an exit notification is
reliably generated but the subthreads may continue to run for an
indeterminate amount of time and thus also may exec at some point.

So far there was no way to distinguish between (1) and (2) internally.
This tiny series tries to address this problem by discarding
PIDFD_THREAD notification on premature thread-group leader exit.

If that works correctly then no exit notifications are generated for a
PIDFD_THREAD pidfd for a thread-group leader until all subthreads have
been reaped. If a subthread should exec aftewards no exit notification
will be generated until that task exits or it creates subthreads and
repeates the cycle.

Co-Developed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Link: https://lore.kernel.org/r/20250320-work-pidfs-thread_group-v4-1-da678ce805bf@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-20 15:32:43 +01:00
Christian Brauner
68db272741
pidfs: ensure that PIDFS_INFO_EXIT is available
When we currently create a pidfd we check that the task hasn't been
reaped right before we create the pidfd. But it is of course possible
that by the time we return the pidfd to userspace the task has already
been reaped since we don't check again after having created a dentry for
it.

This was fine until now because that race was meaningless. But now that
we provide PIDFD_INFO_EXIT it is a problem because it is possible that
the kernel returns a reaped pidfd and it depends on the race whether
PIDFD_INFO_EXIT information is available. This depends on if the task
gets reaped before or after a dentry has been attached to struct pid.

Make this consistent and only returned pidfds for reaped tasks if
PIDFD_INFO_EXIT information is available. This is done by performing
another check whether the task has been reaped right after we attached a
dentry to struct pid.

Since pidfs_exit() is called before struct pid's task linkage is removed
the case where the task got reaped but a dentry was already attached to
struct pid and exit information was recorded and published can be
handled correctly. In that case we do return a pidfd for a reaped task
like we would've before.

Link: https://lore.kernel.org/r/20250316-kabel-fehden-66bdb6a83436@brauner
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-19 14:40:18 +01:00
Christian Brauner
7477d7dce4
pidfs: allow to retrieve exit information
Some tools like systemd's jounral need to retrieve the exit and cgroup
information after a process has already been reaped. This can e.g.,
happen when retrieving a pidfd via SCM_PIDFD or SCM_PEERPIDFD.

Link: https://lore.kernel.org/r/20250305-work-pidfs-kill_on_last_close-v3-6-c8c3d8361705@kernel.org
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-05 13:26:17 +01:00
Christian Brauner
4513522984
pidfs: record exit code and cgroupid at exit
Record the exit code and cgroupid in release_task() and stash in struct
pidfs_exit_info so it can be retrieved even after the task has been
reaped.

Link: https://lore.kernel.org/r/20250305-work-pidfs-kill_on_last_close-v3-5-c8c3d8361705@kernel.org
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-05 13:26:12 +01:00
Christian Brauner
0b4200381e
pidfs: use private inode slab cache
Introduce a private inode slab cache for pidfs. In follow-up patches
pidfs will gain the ability to provide exit information to userspace
after the task has been reaped. This means storing exit information even
after the task has already been released and struct pid's task linkage
is gone. Store that information alongside the inode.

Link: https://lore.kernel.org/r/20250305-work-pidfs-kill_on_last_close-v3-4-c8c3d8361705@kernel.org
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-05 13:26:09 +01:00
Christian Brauner
3155a19407
pidfs: move setting flags into pidfs_alloc_file()
Instead od adding it into __pidfd_prepare() place it where the actual
file allocation happens and update the outdated comment.

Link: https://lore.kernel.org/r/20250305-work-pidfs-kill_on_last_close-v3-3-c8c3d8361705@kernel.org
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-05 13:26:05 +01:00
Christian Brauner
816b2e6020
pidfs: switch to copy_struct_to_user()
We have a helper that deals with all the required logic.

Link: https://lore.kernel.org/r/20250305-work-pidfs-kill_on_last_close-v3-1-c8c3d8361705@kernel.org
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-05 13:24:52 +01:00
Christian Brauner
02cfe2b652
pidfs: remove d_op->d_delete
Pidfs only deals with unhashed dentries and there's currently no way for
them to become hashed. So remove d_op->d_delete.

Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-24 11:27:02 +01:00
Christian Brauner
091ee63e36
pidfs: improve ioctl handling
Pidfs supports extensible and non-extensible ioctls. The extensible
ioctls need to check for the ioctl number itself not just the ioctl
command otherwise both backward- and forward compatibility are broken.

The pidfs ioctl handler also needs to look at the type of the ioctl
command to guard against cases where "[...] a daemon receives some
random file descriptor from a (potentially less privileged) client and
expects the FD to be of some specific type, it might call ioctl() on
this FD with some type-specific command and expect the call to fail if
the FD is of the wrong type; but due to the missing type check, the
kernel instead performs some action that userspace didn't expect."
(cf. [1]]

Link: https://lore.kernel.org/r/20250204-work-pidfs-ioctl-v1-1-04987d239575@kernel.org
Link: https://lore.kernel.org/r/CAG48ez2K9A5GwtgqO31u9ZL292we8ZwAA=TJwwEv7wRuJ3j4Lw@mail.gmail.com [1]
Fixes: 8ce3528188 ("pidfs: check for valid ioctl commands")
Acked-by: Luca Boccassi <luca.boccassi@gmail.com>
Reported-by: Jann Horn <jannh@google.com>
Cc: stable@vger.kernel.org # v6.13; please backport with 8ce3528188 ("pidfs: check for valid ioctl commands")
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-07 10:27:28 +01:00
Christian Brauner
ef4144ac2d
pidfs: allow bind-mounts
Allow bind-mounting pidfds. Similar to nsfs let's allow bind-mounts for
pidfds. This allows pidfds to be safely recovered and checked for
process recycling.

Link: https://lore.kernel.org/r/20241219-work-pidfs-mount-v1-1-dbc56198b839@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-12-22 11:03:10 +01:00
Christian Brauner
16ecd47cb0
pidfs: lookup pid through rbtree
The new pid inode number allocation scheme is neat but I overlooked a
possible, even though unlikely, attack that can be used to trigger an
overflow on both 32bit and 64bit.

An unique 64 bit identifier was constructed for each struct pid by two
combining a 32 bit idr with a 32 bit generation number. A 32bit number
was allocated using the idr_alloc_cyclic() infrastructure. When the idr
wrapped around a 32 bit wraparound counter was incremented. The 32 bit
wraparound counter served as the upper 32 bits and the allocated idr
number as the lower 32 bits.

Since the idr can only allocate up to INT_MAX entries everytime a
wraparound happens INT_MAX - 1 entries are lost (Ignoring that numbering
always starts at 2 to avoid theoretical collisions with the root inode
number.).

If userspace fully populates the idr such that and puts itself into
control of two entries such that one entry is somewhere in the middle
and the other entry is the INT_MAX entry then it is possible to overflow
the wraparound counter. That is probably difficult to pull off but the
mere possibility is annoying.

The problem could be contained to 32 bit by switching to a data
structure such as the maple tree that allows allocating 64 bit numbers
on 64 bit machines. That would leave 32 bit in a lurch but that probably
doesn't matter that much. The other problem is that removing entries
form the maple tree is somewhat non-trivial because the removal code can
be called under the irq write lock of tasklist_lock and
irq{save,restore} code.

Instead, allocate unique identifiers for struct pid by simply
incrementing a 64 bit counter and insert each struct pid into the rbtree
so it can be looked up to decode file handles avoiding to leak actual
pids across pid namespaces in file handles.

On both 64 bit and 32 bit the same 64 bit identifier is used to lookup
struct pid in the rbtree. On 64 bit the unique identifier for struct pid
simply becomes the inode number. Comparing two pidfds continues to be as
simple as comparing inode numbers.

On 32 bit the 64 bit number assigned to struct pid is split into two 32
bit numbers. The lower 32 bits are used as the inode number and the
upper 32 bits are used as the inode generation number. Whenever a
wraparound happens on 32 bit the 64 bit number will be incremented by 2
so inode numbering starts at 2 again.

When a wraparound happens on 32 bit multiple pidfds with the same inode
number are likely to exist. This isn't a problem since before pidfs
pidfds used the anonymous inode meaning all pidfds had the same inode
number. On 32 bit sserspace can thus reconstruct the 64 bit identifier
by retrieving both the inode number and the inode generation number to
compare, or use file handles. This gives the same guarantees on both 32
bit and 64 bit.

Link: https://lore.kernel.org/r/20241214-gekoppelt-erdarbeiten-a1f9a982a5a6@brauner
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-12-17 09:16:18 +01:00
Christian Brauner
8ce3528188
pidfs: check for valid ioctl commands
Prior to doing any work, check whether the provided ioctl command is
supported by pidfs.

Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-12-17 09:16:18 +01:00
Christian Brauner
b3caba8f7a
pidfs: implement file handle support
On 64-bit platforms, userspace can read the pidfd's inode in order to
get a never-repeated PID identifier. On 32-bit platforms this identifier
is not exposed, as inodes are limited to 32 bits. Instead expose the
identifier via export_fh, which makes it available to userspace via
name_to_handle_at.

In addition we implement fh_to_dentry, which allows userspace to
recover a pidfd from a pidfs file handle.

Signed-off-by: Erin Shepherd <erin.shepherd@e43.eu>
[brauner: patch heavily rewritten]
Link: https://lore.kernel.org/r/20241129-work-pidfs-file_handle-v1-6-87d803a42495@kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Co-Developed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-12-17 09:16:17 +01:00
Christian Brauner
230536ff6b
pidfs: support FS_IOC_GETVERSION
This will allow 32 bit userspace to detect when a given inode number has
been recycled and also to construct a unique 64 bit identifier.

Link: https://lore.kernel.org/r/20241129-work-pidfs-v2-3-61043d66fbce@kernel.org
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-12-14 12:40:40 +01:00
Christian Brauner
03c212bf3f
pidfs: remove 32bit inode number handling
Now that we have a unified inode number handling model remove the custom
ida-based allocation for 32bit.

Link: https://lore.kernel.org/r/20241129-work-pidfs-v2-2-61043d66fbce@kernel.org
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-12-14 12:40:40 +01:00
Christian Brauner
9698d5a483
pidfs: rework inode number allocation
Recently we received a patchset that aims to enable file handle encoding
and decoding via name_to_handle_at(2) and open_by_handle_at(2).

A crucical step in the patch series is how to go from inode number to
struct pid without leaking information into unprivileged contexts. The
issue is that in order to find a struct pid the pid number in the
initial pid namespace must be encoded into the file handle via
name_to_handle_at(2). This can be used by containers using a separate
pid namespace to learn what the pid number of a given process in the
initial pid namespace is. While this is a weak information leak it could
be used in various exploits and in general is an ugly wart in the design.

To solve this problem a new way is needed to lookup a struct pid based
on the inode number allocated for that struct pid. The other part is to
remove the custom inode number allocation on 32bit systems that is also
an ugly wart that should go away.

So, a new scheme is used that I was discusssing with Tejun some time
back. A cyclic ida is used for the lower 32 bits and a the high 32 bits
are used for the generation number. This gives a 64 bit inode number
that is unique on both 32 bit and 64 bit. The lower 32 bit number is
recycled slowly and can be used to lookup struct pids.

Link: https://lore.kernel.org/r/20241129-work-pidfs-v2-1-61043d66fbce@kernel.org
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-12-14 12:40:31 +01:00
Luca Boccassi
cdda1f26e7
pidfd: add ioctl to retrieve pid info
A common pattern when using pid fds is having to get information
about the process, which currently requires /proc being mounted,
resolving the fd to a pid, and then do manual string parsing of
/proc/N/status and friends. This needs to be reimplemented over
and over in all userspace projects (e.g.: I have reimplemented
resolving in systemd, dbus, dbus-daemon, polkit so far), and
requires additional care in checking that the fd is still valid
after having parsed the data, to avoid races.

Having a programmatic API that can be used directly removes all
these requirements, including having /proc mounted.

As discussed at LPC24, add an ioctl with an extensible struct
so that more parameters can be added later if needed. Start with
returning pid/tgid/ppid and creds unconditionally, and cgroupid
optionally.

Signed-off-by: Luca Boccassi <luca.boccassi@gmail.com>
Link: https://lore.kernel.org/r/20241010155401.2268522-1-luca.boccassi@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-10-24 13:54:51 +02:00
Christian Brauner
8a46067783
pidfs: check for valid pid namespace
When we access a no-current task's pid namespace we need check that the
task hasn't been reaped in the meantime and it's pid namespace isn't
accessible anymore.

The user namespace is fine because it is only released when the last
reference to struct task_struct is put and exit_creds() is called.

Link: https://lore.kernel.org/r/20240926-klebt-altgedienten-0415ad4d273c@brauner
Fixes: 5b08bd4085 ("pidfs: allow retrieval of namespace file descriptors")
CC: stable@vger.kernel.org # v6.11
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-09-27 18:29:19 +02:00
Christian Brauner
9b3e150464
pidfs: handle kernels without namespaces cleanly
The nsproxy structure contains nearly all of the namespaces associated
with a task. When a given namespace type is not supported by this kernel
the rules whether the corresponding pointer in struct nsproxy is NULL or
always init_<ns_type>_ns differ per namespace. Ideally, that wouldn't be
the case and for all namespace types we'd always set it to
init_<ns_type>_ns when the corresponding namespace type isn't supported.

Make sure we handle all namespaces where the pointer in struct nsproxy
can be NULL when the namespace type isn't supported.

Link: https://lore.kernel.org/r/20240722-work-pidfs-e6a83030f63e@brauner
Fixes: 5b08bd4085 ("pidfs: allow retrieval of namespace file descriptors") # mainline only
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-07-24 10:53:13 +02:00
Edward Adam Davis
f60d38cb02
pidfs: when time ns disabled add check for ioctl
syzbot call pidfd_ioctl() with cmd "PIDFD_GET_TIME_NAMESPACE" and disabled
CONFIG_TIME_NS, since time_ns is NULL, it will make NULL ponter deref in
open_namespace.

Fixes: 5b08bd4085 ("pidfs: allow retrieval of namespace file descriptors") # mainline only
Reported-and-tested-by: syzbot+34a0ee986f61f15da35d@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=34a0ee986f61f15da35d
Signed-off-by: Edward Adam Davis <eadavis@qq.com>
Link: https://lore.kernel.org/r/tencent_7FAE8DB725EE0DD69236DDABDDDE195E4F07@qq.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-07-24 10:53:12 +02:00
Christian Brauner
5b08bd4085
pidfs: allow retrieval of namespace file descriptors
For users that hold a reference to a pidfd procfs might not even be
available nor is it desirable to parse through procfs just for the sake
of getting namespace file descriptors for a process.

Make it possible to directly retrieve namespace file descriptors from a
pidfd. Pidfds already can be used with setns() to change a set of
namespaces atomically.

Link: https://lore.kernel.org/r/20240627-work-pidfs-v1-4-7e9ab6cc3bb1@kernel.org
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-06-28 10:37:29 +02:00
Linus Torvalds
db3d841ac9 fs/pidfs: make 'lsof' happy with our inode changes
pidfs started using much saner inodes in commit b28ddcc32d ("pidfs:
convert to path_from_stashed() helper"), but that exposed the fact that
lsof had some knowledge of just how odd our old anon_inode usage was.

For example, legacy anon_inodes hadn't even initialized the inode type
in the inode mode, so everything had a type of zero.

So sane tools like 'stat' would report these files as "weird file", but
'lsof' instead used that (together with the name of the link in proc) to
notice that it's an anonymous inode, and used it to detect pidfd files.

Let's keep our internal new sane inode model, but mask the file type
bits at 'stat()' time in the getattr() function we already have, and by
making the dentry name match what lsof expects too.

This keeps our internal models sane, but should make user space see the
same old odd behavior.

Reported-by: Jiri Slaby <jirislaby@kernel.org>
Link: https://lore.kernel.org/all/a15b1050-4b52-4740-a122-a4d055c17f11@kernel.org/
Link: https://github.com/lsof-org/lsof/issues/317
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Seth Forshee <sforshee@kernel.org>
Cc: Tycho Andersen <tycho@tycho.pizza>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-05-21 08:08:00 -07:00
Christian Brauner
9d9539db86 pidfs: remove config option
As Linus suggested this enables pidfs unconditionally. A key property to
retain is the ability to compare pidfds by inode number (cf. [1]).
That's extremely helpful just as comparing namespace file descriptors by
inode number is. They are used in a variety of scenarios where they need
to be compared, e.g., when receiving a pidfd via SO_PEERPIDFD from a
socket to trivially authenticate a the sender and various other
use-cases.

For 64bit systems this is pretty trivial to do. For 32bit it's slightly
more annoying as we discussed but we simply add a dumb ida based
allocator that gets used on 32bit. This gives the same guarantees about
inode numbers on 64bit without any overflow risk. Practically, we'll
never run into overflow issues because we're constrained by the number
of processes that can exist on 32bit and by the number of open files
that can exist on a 32bit system. On 64bit none of this matters and
things are very simple.

If 32bit also needs the uniqueness guarantee they can simply parse the
contents of /proc/<pid>/fd/<nr>. The uniqueness guarantees have a
variety of use-cases. One of the most obvious ones is that they will
make pidfiles (or "pidfdfiles", I guess) reliable as the unique
identifier can be placed into there that won't be reycled. Also a
frequent request.

Note, I took the chance and simplified path_from_stashed() even further.
Instead of passing the inode number explicitly to path_from_stashed() we
let the filesystem handle that internally. So path_from_stashed() ends
up even simpler than it is now. This is also a good solution allowing
the cleanup code to be clean and consistent between 32bit and 64bit. The
cleanup path in prepare_anon_dentry() is also switched around so we put
the inode before the dentry allocation. This means we only have to call
the cleanup handler for the filesystem's inode data once and can rely
->evict_inode() otherwise.

Aside from having to have a bit of extra code for 32bit it actually ends
up a nice cleanup for path_from_stashed() imho.

Tested on both 32 and 64bit including error injection.

Link: https://github.com/systemd/systemd/pull/31713 [1]
Link: https://lore.kernel.org/r/20240312-dingo-sehnlich-b3ecc35c6de7@brauner
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-03-13 12:53:53 -07:00
Christian Brauner
e9c5263ce1
libfs: improve path_from_stashed()
Right now we pass a bunch of info that is fs specific which doesn't make
a lot of sense and it bleeds fs sepcific details into the generic
helper. nsfs and pidfs have slightly different needs when initializing
inodes. Add simple operations that are stashed in sb->s_fs_info that
both can implement. This also allows us to get rid of cleaning up
references in the caller. All in all path_from_stashed() becomes way
simpler.

Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-03-01 22:31:40 +01:00
Christian Brauner
2558e3b231
libfs: add stashed_dentry_prune()
Both pidfs and nsfs use a memory location to stash a dentry for reuse by
concurrent openers. Right now two custom
dentry->d_prune::{ns,pidfs}_prune_dentry() methods are needed that do
the same thing. The only thing that differs is that they need to get to
the memory location to store or retrieve the dentry from differently.
Fix that by remember the stashing location for the dentry in
dentry->d_fsdata which allows us to retrieve it in dentry->d_prune. That
in turn makes it possible to add a common helper that pidfs and nsfs can
both use.

Link: https://lore.kernel.org/r/CAHk-=wg8cHY=i3m6RnXQ2Y2W8psicKWQEZq1=94ivUiviM-0OA@mail.gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-03-01 12:26:29 +01:00
Christian Brauner
159a0d9fd5
libfs: improve path_from_stashed() helper
In earlier patches we moved both nsfs and pidfs to path_from_stashed().
The helper currently tries to add and stash a new dentry if a reusable
dentry couldn't be found and returns EAGAIN if it lost the race to stash
the dentry. The caller can use EAGAIN to retry.

The helper and the two filesystems be written in a way that makes
returning EAGAIN unnecessary. To do this we need to change the
dentry->d_prune() implementation of nsfs and pidfs to not simply replace
the stashed dentry with NULL but to use a cmpxchg() and only replace
their own dentry.

Then path_from_stashed() can then be changed to not just stash a new
dentry when no dentry is currently stashed but also when an already dead
dentry is stashed. If another task managed to install a dentry in the
meantime it can simply be reused. Pack that into a loop and call it a
day.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/CAHk-=wgtLF5Z5=15-LKAczWm=-tUjHO+Bpf7WjBG+UU3s=fEQw@mail.gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-03-01 12:26:23 +01:00