linux/include/drm/drm_device.h
Raag Jadav b7cf9f4ac1
drm: Introduce device wedged event
Introduce device wedged event, which notifies userspace of 'wedged'
(hanged/unusable) state of the DRM device through a uevent. This is
useful especially in cases where the device is no longer operating as
expected and has become unrecoverable from driver context. Purpose of
this implementation is to provide drivers a generic way to recover the
device with the help of userspace intervention without taking any drastic
measures (like resetting or re-enumerating the full bus, on which the
underlying physical device is sitting) in the driver.

A 'wedged' device is basically a device that is declared dead by the
driver after exhausting all possible attempts to recover it from driver
context. The uevent is the notification that is sent to userspace along
with a hint about what could possibly be attempted to recover the device
from userspace and bring it back to usable state. Different drivers may
have different ideas of a 'wedged' device depending on hardware
implementation of the underlying physical device, and hence the vendor
agnostic nature of the event. It is up to the drivers to decide when they
see the need for device recovery and how they want to recover from the
available methods.

Driver prerequisites
--------------------

The driver, before opting for recovery, needs to make sure that the
'wedged' device doesn't harm the system as a whole by taking care of the
prerequisites. Necessary actions must include disabling DMA to system
memory as well as any communication channels with other devices. Further,
the driver must ensure that all dma_fences are signalled and any device
state that the core kernel might depend on is cleaned up. All existing
mmaps should be invalidated and page faults should be redirected to a
dummy page. Once the event is sent, the device must be kept in 'wedged'
state until the recovery is performed. New accesses to the device
(IOCTLs) should be rejected, preferably with an error code that resembles
the type of failure the device has encountered. This will signify the
reason for wedging, which can be reported to the application if needed.

Recovery
--------

Current implementation defines three recovery methods, out of which,
drivers can use any one, multiple or none. Method(s) of choice will be
sent in the uevent environment as ``WEDGED=<method1>[,..,<methodN>]`` in
order of less to more side-effects. If driver is unsure about recovery
or method is unknown (like soft/hard system reboot, firmware flashing,
physical device replacement or any other procedure which can't be
attempted on the fly), ``WEDGED=unknown`` will be sent instead.

Userspace consumers can parse this event and attempt recovery as per the
following expectations.

    =============== ========================================
    Recovery method Consumer expectations
    =============== ========================================
    none            optional telemetry collection
    rebind          unbind + bind driver
    bus-reset       unbind + bus reset/re-enumeration + bind
    unknown         consumer policy
    =============== ========================================

The only exception to this is ``WEDGED=none``, which signifies that the
device was temporarily 'wedged' at some point but was recovered from driver
context using device specific methods like reset. No explicit recovery is
expected from the consumer in this case, but it can still take additional
steps like gathering telemetry information (devcoredump, syslog). This is
useful because the first hang is usually the most critical one which can
result in consequential hangs or complete wedging.

Consumer prerequisites
----------------------

It is the responsibility of the consumer to make sure that the device or
its resources are not in use by any process before attempting recovery.
With IOCTLs erroring out, all device memory should be unmapped and file
descriptors should be closed to prevent leaks or undefined behaviour. The
idea here is to clear the device of all user context beforehand and set
the stage for a clean recovery.

Example
-------

Udev rule::

    SUBSYSTEM=="drm", ENV{WEDGED}=="rebind", DEVPATH=="*/drm/card[0-9]",
    RUN+="/path/to/rebind.sh $env{DEVPATH}"

Recovery script::

    #!/bin/sh

    DEVPATH=$(readlink -f /sys/$1/device)
    DEVICE=$(basename $DEVPATH)
    DRIVER=$(readlink -f $DEVPATH/driver)

    echo -n $DEVICE > $DRIVER/unbind
    echo -n $DEVICE > $DRIVER/bind

Customization
-------------

Although basic recovery is possible with a simple script, consumers can
define custom policies around recovery. For example, if the driver supports
multiple recovery methods, consumers can opt for the suitable one depending
on scenarios like repeat offences or vendor specific failures. Consumers
can also choose to have the device available for debugging or telemetry
collection and base their recovery decision on the findings. This is useful
especially when the driver is unsure about recovery or method is unknown.

 v4: s/drm_dev_wedged/drm_dev_wedged_event
     Use drm_info() (Jani)
     Kernel doc adjustment (Aravind)
 v5: Send recovery method with uevent (Lina)
 v6: Access wedge_recovery_opts[] using helper function (Jani)
     Use snprintf() (Jani)
 v7: Convert recovery helpers into regular functions (Andy, Jani)
     Aesthetic adjustments (Andy)
     Handle invalid recovery method
 v8: Allow sending multiple methods with uevent (Lucas, Michal)
     static_assert() globally (Andy)
 v9: Provide 'none' method for device reset (Christian)
     Provide recovery opts using switch cases
v11: Log device reset (André)

Signed-off-by: Raag Jadav <raag.jadav@intel.com>
Reviewed-by: André Almeida <andrealmeid@igalia.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20250204070528.1919158-2-raag.jadav@intel.com
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2025-02-13 12:15:43 -05:00

330 lines
8.1 KiB
C

#ifndef _DRM_DEVICE_H_
#define _DRM_DEVICE_H_
#include <linux/list.h>
#include <linux/kref.h>
#include <linux/mutex.h>
#include <linux/idr.h>
#include <drm/drm_mode_config.h>
struct drm_driver;
struct drm_minor;
struct drm_master;
struct drm_vblank_crtc;
struct drm_vma_offset_manager;
struct drm_vram_mm;
struct drm_fb_helper;
struct inode;
struct pci_dev;
struct pci_controller;
/*
* Recovery methods for wedged device in order of less to more side-effects.
* To be used with drm_dev_wedged_event() as recovery @method. Callers can
* use any one, multiple (or'd) or none depending on their needs.
*/
#define DRM_WEDGE_RECOVERY_NONE BIT(0) /* optional telemetry collection */
#define DRM_WEDGE_RECOVERY_REBIND BIT(1) /* unbind + bind driver */
#define DRM_WEDGE_RECOVERY_BUS_RESET BIT(2) /* unbind + reset bus device + bind */
/**
* enum switch_power_state - power state of drm device
*/
enum switch_power_state {
/** @DRM_SWITCH_POWER_ON: Power state is ON */
DRM_SWITCH_POWER_ON = 0,
/** @DRM_SWITCH_POWER_OFF: Power state is OFF */
DRM_SWITCH_POWER_OFF = 1,
/** @DRM_SWITCH_POWER_CHANGING: Power state is changing */
DRM_SWITCH_POWER_CHANGING = 2,
/** @DRM_SWITCH_POWER_DYNAMIC_OFF: Suspended */
DRM_SWITCH_POWER_DYNAMIC_OFF = 3,
};
/**
* struct drm_device - DRM device structure
*
* This structure represent a complete card that
* may contain multiple heads.
*/
struct drm_device {
/** @if_version: Highest interface version set */
int if_version;
/** @ref: Object ref-count */
struct kref ref;
/** @dev: Device structure of bus-device */
struct device *dev;
/**
* @managed:
*
* Managed resources linked to the lifetime of this &drm_device as
* tracked by @ref.
*/
struct {
/** @managed.resources: managed resources list */
struct list_head resources;
/** @managed.final_kfree: pointer for final kfree() call */
void *final_kfree;
/** @managed.lock: protects @managed.resources */
spinlock_t lock;
} managed;
/** @driver: DRM driver managing the device */
const struct drm_driver *driver;
/**
* @dev_private:
*
* DRM driver private data. This is deprecated and should be left set to
* NULL.
*
* Instead of using this pointer it is recommended that drivers use
* devm_drm_dev_alloc() and embed struct &drm_device in their larger
* per-device structure.
*/
void *dev_private;
/**
* @primary:
*
* Primary node. Drivers should not interact with this
* directly. debugfs interfaces can be registered with
* drm_debugfs_add_file(), and sysfs should be directly added on the
* hardware (and not character device node) struct device @dev.
*/
struct drm_minor *primary;
/**
* @render:
*
* Render node. Drivers should not interact with this directly ever.
* Drivers should not expose any additional interfaces in debugfs or
* sysfs on this node.
*/
struct drm_minor *render;
/** @accel: Compute Acceleration node */
struct drm_minor *accel;
/**
* @registered:
*
* Internally used by drm_dev_register() and drm_connector_register().
*/
bool registered;
/**
* @master:
*
* Currently active master for this device.
* Protected by &master_mutex
*/
struct drm_master *master;
/**
* @driver_features: per-device driver features
*
* Drivers can clear specific flags here to disallow
* certain features on a per-device basis while still
* sharing a single &struct drm_driver instance across
* all devices.
*/
u32 driver_features;
/**
* @unplugged:
*
* Flag to tell if the device has been unplugged.
* See drm_dev_enter() and drm_dev_is_unplugged().
*/
bool unplugged;
/** @anon_inode: inode for private address-space */
struct inode *anon_inode;
/** @unique: Unique name of the device */
char *unique;
/**
* @struct_mutex:
*
* Lock for others (not &drm_minor.master and &drm_file.is_master)
*
* TODO: This lock used to be the BKL of the DRM subsystem. Move the
* lock into i915, which is the only remaining user.
*/
struct mutex struct_mutex;
/**
* @master_mutex:
*
* Lock for &drm_minor.master and &drm_file.is_master
*/
struct mutex master_mutex;
/**
* @open_count:
*
* Usage counter for outstanding files open,
* protected by drm_global_mutex
*/
atomic_t open_count;
/** @filelist_mutex: Protects @filelist. */
struct mutex filelist_mutex;
/**
* @filelist:
*
* List of userspace clients, linked through &drm_file.lhead.
*/
struct list_head filelist;
/**
* @filelist_internal:
*
* List of open DRM files for in-kernel clients.
* Protected by &filelist_mutex.
*/
struct list_head filelist_internal;
/**
* @clientlist_mutex:
*
* Protects &clientlist access.
*/
struct mutex clientlist_mutex;
/**
* @clientlist:
*
* List of in-kernel clients. Protected by &clientlist_mutex.
*/
struct list_head clientlist;
/**
* @vblank_disable_immediate:
*
* If true, vblank interrupt will be disabled immediately when the
* refcount drops to zero, as opposed to via the vblank disable
* timer.
*
* This can be set to true it the hardware has a working vblank counter
* with high-precision timestamping (otherwise there are races) and the
* driver uses drm_crtc_vblank_on() and drm_crtc_vblank_off()
* appropriately. Also, see @max_vblank_count,
* &drm_crtc_funcs.get_vblank_counter and
* &drm_vblank_crtc_config.disable_immediate.
*/
bool vblank_disable_immediate;
/**
* @vblank:
*
* Array of vblank tracking structures, one per &struct drm_crtc. For
* historical reasons (vblank support predates kernel modesetting) this
* is free-standing and not part of &struct drm_crtc itself. It must be
* initialized explicitly by calling drm_vblank_init().
*/
struct drm_vblank_crtc *vblank;
/**
* @vblank_time_lock:
*
* Protects vblank count and time updates during vblank enable/disable
*/
spinlock_t vblank_time_lock;
/**
* @vbl_lock: Top-level vblank references lock, wraps the low-level
* @vblank_time_lock.
*/
spinlock_t vbl_lock;
/**
* @max_vblank_count:
*
* Maximum value of the vblank registers. This value +1 will result in a
* wrap-around of the vblank register. It is used by the vblank core to
* handle wrap-arounds.
*
* If set to zero the vblank core will try to guess the elapsed vblanks
* between times when the vblank interrupt is disabled through
* high-precision timestamps. That approach is suffering from small
* races and imprecision over longer time periods, hence exposing a
* hardware vblank counter is always recommended.
*
* This is the statically configured device wide maximum. The driver
* can instead choose to use a runtime configurable per-crtc value
* &drm_vblank_crtc.max_vblank_count, in which case @max_vblank_count
* must be left at zero. See drm_crtc_set_max_vblank_count() on how
* to use the per-crtc value.
*
* If non-zero, &drm_crtc_funcs.get_vblank_counter must be set.
*/
u32 max_vblank_count;
/** @vblank_event_list: List of vblank events */
struct list_head vblank_event_list;
/**
* @event_lock:
*
* Protects @vblank_event_list and event delivery in
* general. See drm_send_event() and drm_send_event_locked().
*/
spinlock_t event_lock;
/** @num_crtcs: Number of CRTCs on this device */
unsigned int num_crtcs;
/** @mode_config: Current mode config */
struct drm_mode_config mode_config;
/** @object_name_lock: GEM information */
struct mutex object_name_lock;
/** @object_name_idr: GEM information */
struct idr object_name_idr;
/** @vma_offset_manager: GEM information */
struct drm_vma_offset_manager *vma_offset_manager;
/** @vram_mm: VRAM MM memory manager */
struct drm_vram_mm *vram_mm;
/**
* @switch_power_state:
*
* Power state of the client.
* Used by drivers supporting the switcheroo driver.
* The state is maintained in the
* &vga_switcheroo_client_ops.set_gpu_state callback
*/
enum switch_power_state switch_power_state;
/**
* @fb_helper:
*
* Pointer to the fbdev emulation structure.
* Set by drm_fb_helper_init() and cleared by drm_fb_helper_fini().
*/
struct drm_fb_helper *fb_helper;
/**
* @debugfs_root:
*
* Root directory for debugfs files.
*/
struct dentry *debugfs_root;
};
#endif