mirror of
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
synced 2025-08-05 16:54:27 +00:00

Allow userspace to read/write log ratelimits per device (including enable/disable). Create aer/ sysfs directory to store them and any future AER configs. The new sysfs files are: /sys/bus/pci/devices/*/aer/correctable_ratelimit_burst /sys/bus/pci/devices/*/aer/correctable_ratelimit_interval_ms /sys/bus/pci/devices/*/aer/nonfatal_ratelimit_burst /sys/bus/pci/devices/*/aer/nonfatal_ratelimit_interval_ms The default values are ratelimit_burst=10, ratelimit_interval_ms=5000, so if we try to emit more than 10 messages in a 5 second period, some are suppressed. Update AER sysfs ABI filename to reflect the broader scope of AER sysfs attributes (e.g. stats and ratelimits). Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats -> sysfs-bus-pci-devices-aer Tested using aer-inject[1]. Configured correctable log ratelimit to 5. Sent 6 AER errors. Observed 5 errors logged while AER stats (cat /sys/bus/pci/devices/<dev>/aer_dev_correctable) shows 6. Disabled ratelimiting and sent 6 more AER errors. Observed all 6 errors logged and accounted in AER stats (12 total errors). [1] https://git.kernel.org/pub/scm/linux/kernel/git/gong.chen/aer-inject.git [bhelgaas: note fatal errors are not ratelimited, "aer_report" -> "aer_info", replace ratelimit_log_enable toggle with *_ratelimit_interval_ms] Signed-off-by: Karolina Stolarek <karolina.stolarek@oracle.com> Signed-off-by: Jon Pan-Doh <pandoh@google.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Link: https://patch.msgid.link/20250522232339.1525671-21-helgaas@kernel.org
163 lines
6 KiB
Text
163 lines
6 KiB
Text
PCIe Device AER statistics
|
|
--------------------------
|
|
|
|
These attributes show up under all the devices that are AER capable. These
|
|
statistical counters indicate the errors "as seen/reported by the device".
|
|
Note that this may mean that if an endpoint is causing problems, the AER
|
|
counters may increment at its link partner (e.g. root port) because the
|
|
errors may be "seen" / reported by the link partner and not the
|
|
problematic endpoint itself (which may report all counters as 0 as it never
|
|
saw any problems).
|
|
|
|
What: /sys/bus/pci/devices/<dev>/aer_dev_correctable
|
|
Date: July 2018
|
|
KernelVersion: 4.19.0
|
|
Contact: linux-pci@vger.kernel.org, rajatja@google.com
|
|
Description: List of correctable errors seen and reported by this
|
|
PCI device using ERR_COR. Note that since multiple errors may
|
|
be reported using a single ERR_COR message, thus
|
|
TOTAL_ERR_COR at the end of the file may not match the actual
|
|
total of all the errors in the file. Sample output::
|
|
|
|
localhost /sys/devices/pci0000:00/0000:00:1c.0 # cat aer_dev_correctable
|
|
Receiver Error 2
|
|
Bad TLP 0
|
|
Bad DLLP 0
|
|
RELAY_NUM Rollover 0
|
|
Replay Timer Timeout 0
|
|
Advisory Non-Fatal 0
|
|
Corrected Internal Error 0
|
|
Header Log Overflow 0
|
|
TOTAL_ERR_COR 2
|
|
|
|
What: /sys/bus/pci/devices/<dev>/aer_dev_fatal
|
|
Date: July 2018
|
|
KernelVersion: 4.19.0
|
|
Contact: linux-pci@vger.kernel.org, rajatja@google.com
|
|
Description: List of uncorrectable fatal errors seen and reported by this
|
|
PCI device using ERR_FATAL. Note that since multiple errors may
|
|
be reported using a single ERR_FATAL message, thus
|
|
TOTAL_ERR_FATAL at the end of the file may not match the actual
|
|
total of all the errors in the file. Sample output::
|
|
|
|
localhost /sys/devices/pci0000:00/0000:00:1c.0 # cat aer_dev_fatal
|
|
Undefined 0
|
|
Data Link Protocol 0
|
|
Surprise Down Error 0
|
|
Poisoned TLP 0
|
|
Flow Control Protocol 0
|
|
Completion Timeout 0
|
|
Completer Abort 0
|
|
Unexpected Completion 0
|
|
Receiver Overflow 0
|
|
Malformed TLP 0
|
|
ECRC 0
|
|
Unsupported Request 0
|
|
ACS Violation 0
|
|
Uncorrectable Internal Error 0
|
|
MC Blocked TLP 0
|
|
AtomicOp Egress Blocked 0
|
|
TLP Prefix Blocked Error 0
|
|
TOTAL_ERR_FATAL 0
|
|
|
|
What: /sys/bus/pci/devices/<dev>/aer_dev_nonfatal
|
|
Date: July 2018
|
|
KernelVersion: 4.19.0
|
|
Contact: linux-pci@vger.kernel.org, rajatja@google.com
|
|
Description: List of uncorrectable nonfatal errors seen and reported by this
|
|
PCI device using ERR_NONFATAL. Note that since multiple errors
|
|
may be reported using a single ERR_FATAL message, thus
|
|
TOTAL_ERR_NONFATAL at the end of the file may not match the
|
|
actual total of all the errors in the file. Sample output::
|
|
|
|
localhost /sys/devices/pci0000:00/0000:00:1c.0 # cat aer_dev_nonfatal
|
|
Undefined 0
|
|
Data Link Protocol 0
|
|
Surprise Down Error 0
|
|
Poisoned TLP 0
|
|
Flow Control Protocol 0
|
|
Completion Timeout 0
|
|
Completer Abort 0
|
|
Unexpected Completion 0
|
|
Receiver Overflow 0
|
|
Malformed TLP 0
|
|
ECRC 0
|
|
Unsupported Request 0
|
|
ACS Violation 0
|
|
Uncorrectable Internal Error 0
|
|
MC Blocked TLP 0
|
|
AtomicOp Egress Blocked 0
|
|
TLP Prefix Blocked Error 0
|
|
TOTAL_ERR_NONFATAL 0
|
|
|
|
PCIe Rootport AER statistics
|
|
----------------------------
|
|
|
|
These attributes show up under only the rootports (or root complex event
|
|
collectors) that are AER capable. These indicate the number of error messages as
|
|
"reported to" the rootport. Please note that the rootports also transmit
|
|
(internally) the ERR_* messages for errors seen by the internal rootport PCI
|
|
device, so these counters include them and are thus cumulative of all the error
|
|
messages on the PCI hierarchy originating at that root port.
|
|
|
|
What: /sys/bus/pci/devices/<dev>/aer_rootport_total_err_cor
|
|
Date: July 2018
|
|
KernelVersion: 4.19.0
|
|
Contact: linux-pci@vger.kernel.org, rajatja@google.com
|
|
Description: Total number of ERR_COR messages reported to rootport.
|
|
|
|
What: /sys/bus/pci/devices/<dev>/aer_rootport_total_err_fatal
|
|
Date: July 2018
|
|
KernelVersion: 4.19.0
|
|
Contact: linux-pci@vger.kernel.org, rajatja@google.com
|
|
Description: Total number of ERR_FATAL messages reported to rootport.
|
|
|
|
What: /sys/bus/pci/devices/<dev>/aer_rootport_total_err_nonfatal
|
|
Date: July 2018
|
|
KernelVersion: 4.19.0
|
|
Contact: linux-pci@vger.kernel.org, rajatja@google.com
|
|
Description: Total number of ERR_NONFATAL messages reported to rootport.
|
|
|
|
PCIe AER ratelimits
|
|
-------------------
|
|
|
|
These attributes show up under all the devices that are AER capable.
|
|
They represent configurable ratelimits of logs per error type.
|
|
|
|
See Documentation/PCI/pcieaer-howto.rst for more info on ratelimits.
|
|
|
|
What: /sys/bus/pci/devices/<dev>/aer/correctable_ratelimit_interval_ms
|
|
Date: May 2025
|
|
KernelVersion: 6.16.0
|
|
Contact: linux-pci@vger.kernel.org
|
|
Description: Writing 0 disables AER correctable error log ratelimiting.
|
|
Writing a positive value sets the ratelimit interval in ms.
|
|
Default is DEFAULT_RATELIMIT_INTERVAL (5000 ms).
|
|
|
|
What: /sys/bus/pci/devices/<dev>/aer/correctable_ratelimit_burst
|
|
Date: May 2025
|
|
KernelVersion: 6.16.0
|
|
Contact: linux-pci@vger.kernel.org
|
|
Description: Ratelimit burst for correctable error logs. Writing a value
|
|
changes the number of errors (burst) allowed per interval
|
|
before ratelimiting. Reading gets the current ratelimit
|
|
burst. Default is DEFAULT_RATELIMIT_BURST (10).
|
|
|
|
What: /sys/bus/pci/devices/<dev>/aer/nonfatal_ratelimit_interval_ms
|
|
Date: May 2025
|
|
KernelVersion: 6.16.0
|
|
Contact: linux-pci@vger.kernel.org
|
|
Description: Writing 0 disables AER non-fatal uncorrectable error log
|
|
ratelimiting. Writing a positive value sets the ratelimit
|
|
interval in ms. Default is DEFAULT_RATELIMIT_INTERVAL
|
|
(5000 ms).
|
|
|
|
What: /sys/bus/pci/devices/<dev>/aer/nonfatal_ratelimit_burst
|
|
Date: May 2025
|
|
KernelVersion: 6.16.0
|
|
Contact: linux-pci@vger.kernel.org
|
|
Description: Ratelimit burst for non-fatal uncorrectable error logs.
|
|
Writing a value changes the number of errors (burst)
|
|
allowed per interval before ratelimiting. Reading gets the
|
|
current ratelimit burst. Default is DEFAULT_RATELIMIT_BURST
|
|
(10).
|