linux/Documentation/ABI/testing/sysfs-bus-pci-devices-aer
Jon Pan-Doh b4fe7398de PCI/AER: Add sysfs attributes for log ratelimits
Allow userspace to read/write log ratelimits per device (including
enable/disable). Create aer/ sysfs directory to store them and any
future AER configs.

The new sysfs files are:

  /sys/bus/pci/devices/*/aer/correctable_ratelimit_burst
  /sys/bus/pci/devices/*/aer/correctable_ratelimit_interval_ms
  /sys/bus/pci/devices/*/aer/nonfatal_ratelimit_burst
  /sys/bus/pci/devices/*/aer/nonfatal_ratelimit_interval_ms

The default values are ratelimit_burst=10, ratelimit_interval_ms=5000, so
if we try to emit more than 10 messages in a 5 second period, some are
suppressed.

Update AER sysfs ABI filename to reflect the broader scope of AER sysfs
attributes (e.g. stats and ratelimits).

  Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats ->
    sysfs-bus-pci-devices-aer

Tested using aer-inject[1]. Configured correctable log ratelimit to 5.
Sent 6 AER errors. Observed 5 errors logged while AER stats
(cat /sys/bus/pci/devices/<dev>/aer_dev_correctable) shows 6.

Disabled ratelimiting and sent 6 more AER errors. Observed all 6 errors
logged and accounted in AER stats (12 total errors).

[1] https://git.kernel.org/pub/scm/linux/kernel/git/gong.chen/aer-inject.git

[bhelgaas: note fatal errors are not ratelimited, "aer_report" ->
"aer_info", replace ratelimit_log_enable toggle with *_ratelimit_interval_ms]

Signed-off-by: Karolina Stolarek <karolina.stolarek@oracle.com>
Signed-off-by: Jon Pan-Doh <pandoh@google.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Link: https://patch.msgid.link/20250522232339.1525671-21-helgaas@kernel.org
2025-05-23 11:11:45 -05:00

163 lines
6 KiB
Text

PCIe Device AER statistics
--------------------------
These attributes show up under all the devices that are AER capable. These
statistical counters indicate the errors "as seen/reported by the device".
Note that this may mean that if an endpoint is causing problems, the AER
counters may increment at its link partner (e.g. root port) because the
errors may be "seen" / reported by the link partner and not the
problematic endpoint itself (which may report all counters as 0 as it never
saw any problems).
What: /sys/bus/pci/devices/<dev>/aer_dev_correctable
Date: July 2018
KernelVersion: 4.19.0
Contact: linux-pci@vger.kernel.org, rajatja@google.com
Description: List of correctable errors seen and reported by this
PCI device using ERR_COR. Note that since multiple errors may
be reported using a single ERR_COR message, thus
TOTAL_ERR_COR at the end of the file may not match the actual
total of all the errors in the file. Sample output::
localhost /sys/devices/pci0000:00/0000:00:1c.0 # cat aer_dev_correctable
Receiver Error 2
Bad TLP 0
Bad DLLP 0
RELAY_NUM Rollover 0
Replay Timer Timeout 0
Advisory Non-Fatal 0
Corrected Internal Error 0
Header Log Overflow 0
TOTAL_ERR_COR 2
What: /sys/bus/pci/devices/<dev>/aer_dev_fatal
Date: July 2018
KernelVersion: 4.19.0
Contact: linux-pci@vger.kernel.org, rajatja@google.com
Description: List of uncorrectable fatal errors seen and reported by this
PCI device using ERR_FATAL. Note that since multiple errors may
be reported using a single ERR_FATAL message, thus
TOTAL_ERR_FATAL at the end of the file may not match the actual
total of all the errors in the file. Sample output::
localhost /sys/devices/pci0000:00/0000:00:1c.0 # cat aer_dev_fatal
Undefined 0
Data Link Protocol 0
Surprise Down Error 0
Poisoned TLP 0
Flow Control Protocol 0
Completion Timeout 0
Completer Abort 0
Unexpected Completion 0
Receiver Overflow 0
Malformed TLP 0
ECRC 0
Unsupported Request 0
ACS Violation 0
Uncorrectable Internal Error 0
MC Blocked TLP 0
AtomicOp Egress Blocked 0
TLP Prefix Blocked Error 0
TOTAL_ERR_FATAL 0
What: /sys/bus/pci/devices/<dev>/aer_dev_nonfatal
Date: July 2018
KernelVersion: 4.19.0
Contact: linux-pci@vger.kernel.org, rajatja@google.com
Description: List of uncorrectable nonfatal errors seen and reported by this
PCI device using ERR_NONFATAL. Note that since multiple errors
may be reported using a single ERR_FATAL message, thus
TOTAL_ERR_NONFATAL at the end of the file may not match the
actual total of all the errors in the file. Sample output::
localhost /sys/devices/pci0000:00/0000:00:1c.0 # cat aer_dev_nonfatal
Undefined 0
Data Link Protocol 0
Surprise Down Error 0
Poisoned TLP 0
Flow Control Protocol 0
Completion Timeout 0
Completer Abort 0
Unexpected Completion 0
Receiver Overflow 0
Malformed TLP 0
ECRC 0
Unsupported Request 0
ACS Violation 0
Uncorrectable Internal Error 0
MC Blocked TLP 0
AtomicOp Egress Blocked 0
TLP Prefix Blocked Error 0
TOTAL_ERR_NONFATAL 0
PCIe Rootport AER statistics
----------------------------
These attributes show up under only the rootports (or root complex event
collectors) that are AER capable. These indicate the number of error messages as
"reported to" the rootport. Please note that the rootports also transmit
(internally) the ERR_* messages for errors seen by the internal rootport PCI
device, so these counters include them and are thus cumulative of all the error
messages on the PCI hierarchy originating at that root port.
What: /sys/bus/pci/devices/<dev>/aer_rootport_total_err_cor
Date: July 2018
KernelVersion: 4.19.0
Contact: linux-pci@vger.kernel.org, rajatja@google.com
Description: Total number of ERR_COR messages reported to rootport.
What: /sys/bus/pci/devices/<dev>/aer_rootport_total_err_fatal
Date: July 2018
KernelVersion: 4.19.0
Contact: linux-pci@vger.kernel.org, rajatja@google.com
Description: Total number of ERR_FATAL messages reported to rootport.
What: /sys/bus/pci/devices/<dev>/aer_rootport_total_err_nonfatal
Date: July 2018
KernelVersion: 4.19.0
Contact: linux-pci@vger.kernel.org, rajatja@google.com
Description: Total number of ERR_NONFATAL messages reported to rootport.
PCIe AER ratelimits
-------------------
These attributes show up under all the devices that are AER capable.
They represent configurable ratelimits of logs per error type.
See Documentation/PCI/pcieaer-howto.rst for more info on ratelimits.
What: /sys/bus/pci/devices/<dev>/aer/correctable_ratelimit_interval_ms
Date: May 2025
KernelVersion: 6.16.0
Contact: linux-pci@vger.kernel.org
Description: Writing 0 disables AER correctable error log ratelimiting.
Writing a positive value sets the ratelimit interval in ms.
Default is DEFAULT_RATELIMIT_INTERVAL (5000 ms).
What: /sys/bus/pci/devices/<dev>/aer/correctable_ratelimit_burst
Date: May 2025
KernelVersion: 6.16.0
Contact: linux-pci@vger.kernel.org
Description: Ratelimit burst for correctable error logs. Writing a value
changes the number of errors (burst) allowed per interval
before ratelimiting. Reading gets the current ratelimit
burst. Default is DEFAULT_RATELIMIT_BURST (10).
What: /sys/bus/pci/devices/<dev>/aer/nonfatal_ratelimit_interval_ms
Date: May 2025
KernelVersion: 6.16.0
Contact: linux-pci@vger.kernel.org
Description: Writing 0 disables AER non-fatal uncorrectable error log
ratelimiting. Writing a positive value sets the ratelimit
interval in ms. Default is DEFAULT_RATELIMIT_INTERVAL
(5000 ms).
What: /sys/bus/pci/devices/<dev>/aer/nonfatal_ratelimit_burst
Date: May 2025
KernelVersion: 6.16.0
Contact: linux-pci@vger.kernel.org
Description: Ratelimit burst for non-fatal uncorrectable error logs.
Writing a value changes the number of errors (burst)
allowed per interval before ratelimiting. Reading gets the
current ratelimit burst. Default is DEFAULT_RATELIMIT_BURST
(10).