mirror of
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
synced 2025-08-05 16:54:27 +00:00

Update memory repair control interface for memory sparing feature. CXL memory devices can support soft and hard memory sparing at cacheline, row, bank and rank granularities. Memory sparing is defined as a repair function that replaces a portion of memory with a portion of functional memory at that same granularity. When a CXL device detects an error in memory, it will report to the host that there's need for a repair maintenance operation by using an event record where the "maintenance needed" flag is set. The event records contain the device physical address (DPA) and other attributes of the memory to repair such as bank group, bank, rank, row, column, channel etc. The kernel will report the corresponding CXL general media or DRAM trace event to userspace, and userspace tools (e.g. rasdaemon) will initiate a repair operation in response to the device request via the sysfs repair control. [ bp: Massage. ] Signed-off-by: Shiju Jose <shiju.jose@huawei.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20250212143654.1893-15-shiju.jose@huawei.com
206 lines
8 KiB
Text
206 lines
8 KiB
Text
What: /sys/bus/edac/devices/<dev-name>/mem_repairX
|
|
Date: March 2025
|
|
KernelVersion: 6.15
|
|
Contact: linux-edac@vger.kernel.org
|
|
Description:
|
|
The sysfs EDAC bus devices /<dev-name>/mem_repairX subdirectory
|
|
pertains to the memory media repair features control, such as
|
|
PPR (Post Package Repair), memory sparing etc, where <dev-name>
|
|
directory corresponds to a device registered with the EDAC
|
|
device driver for the memory repair features.
|
|
|
|
Post Package Repair is a maintenance operation requests the memory
|
|
device to perform a repair operation on its media. It is a memory
|
|
self-healing feature that fixes a failing memory location by
|
|
replacing it with a spare row in a DRAM device. For example, a
|
|
CXL memory device with DRAM components that support PPR features may
|
|
implement PPR maintenance operations. DRAM components may support
|
|
two types of PPR functions: hard PPR, for a permanent row repair, and
|
|
soft PPR, for a temporary row repair. Soft PPR may be much faster
|
|
than hard PPR, but the repair is lost with a power cycle.
|
|
|
|
The sysfs attributes nodes for a repair feature are only
|
|
present if the parent driver has implemented the corresponding
|
|
attr callback function and provided the necessary operations
|
|
to the EDAC device driver during registration.
|
|
|
|
In some states of system configuration (e.g. before address
|
|
decoders have been configured), memory devices (e.g. CXL)
|
|
may not have an active mapping in the main host address
|
|
physical address map. As such, the memory to repair must be
|
|
identified by a device specific physical addressing scheme
|
|
using a device physical address(DPA). The DPA and other control
|
|
attributes to use will be presented in related error records.
|
|
|
|
What: /sys/bus/edac/devices/<dev-name>/mem_repairX/repair_type
|
|
Date: March 2025
|
|
KernelVersion: 6.15
|
|
Contact: linux-edac@vger.kernel.org
|
|
Description:
|
|
(RO) Memory repair type. For eg. post package repair,
|
|
memory sparing etc. Valid values are:
|
|
|
|
- ppr - Post package repair.
|
|
|
|
- cacheline-sparing
|
|
|
|
- row-sparing
|
|
|
|
- bank-sparing
|
|
|
|
- rank-sparing
|
|
|
|
- All other values are reserved.
|
|
|
|
What: /sys/bus/edac/devices/<dev-name>/mem_repairX/persist_mode
|
|
Date: March 2025
|
|
KernelVersion: 6.15
|
|
Contact: linux-edac@vger.kernel.org
|
|
Description:
|
|
(RW) Get/Set the current persist repair mode set for a
|
|
repair function. Persist repair modes supported in the
|
|
device, based on a memory repair function, either is temporary,
|
|
which is lost with a power cycle or permanent. Valid values are:
|
|
|
|
- 0 - Soft memory repair (temporary repair).
|
|
|
|
- 1 - Hard memory repair (permanent repair).
|
|
|
|
- All other values are reserved.
|
|
|
|
What: /sys/bus/edac/devices/<dev-name>/mem_repairX/repair_safe_when_in_use
|
|
Date: March 2025
|
|
KernelVersion: 6.15
|
|
Contact: linux-edac@vger.kernel.org
|
|
Description:
|
|
(RO) True if memory media is accessible and data is retained
|
|
during the memory repair operation.
|
|
The data may not be retained and memory requests may not be
|
|
correctly processed during a repair operation. In such case
|
|
repair operation can not be executed at runtime. The memory
|
|
must be taken offline.
|
|
|
|
What: /sys/bus/edac/devices/<dev-name>/mem_repairX/hpa
|
|
Date: March 2025
|
|
KernelVersion: 6.15
|
|
Contact: linux-edac@vger.kernel.org
|
|
Description:
|
|
(RW) Host Physical Address (HPA) of the memory to repair.
|
|
The HPA to use will be provided in related error records.
|
|
|
|
What: /sys/bus/edac/devices/<dev-name>/mem_repairX/dpa
|
|
Date: March 2025
|
|
KernelVersion: 6.15
|
|
Contact: linux-edac@vger.kernel.org
|
|
Description:
|
|
(RW) Device Physical Address (DPA) of the memory to repair.
|
|
The specific DPA to use will be provided in related error
|
|
records.
|
|
|
|
In some states of system configuration (e.g. before address
|
|
decoders have been configured), memory devices (e.g. CXL)
|
|
may not have an active mapping in the main host address
|
|
physical address map. As such, the memory to repair must be
|
|
identified by a device specific physical addressing scheme
|
|
using a DPA. The device physical address(DPA) to use will be
|
|
presented in related error records.
|
|
|
|
What: /sys/bus/edac/devices/<dev-name>/mem_repairX/nibble_mask
|
|
Date: March 2025
|
|
KernelVersion: 6.15
|
|
Contact: linux-edac@vger.kernel.org
|
|
Description:
|
|
(RW) Read/Write Nibble mask of the memory to repair.
|
|
Nibble mask identifies one or more nibbles in error on the
|
|
memory bus that produced the error event. Nibble Mask bit 0
|
|
shall be set if nibble 0 on the memory bus produced the
|
|
event, etc. For example, CXL PPR and sparing, a nibble mask
|
|
bit set to 1 indicates the request to perform repair
|
|
operation in the specific device. All nibble mask bits set
|
|
to 1 indicates the request to perform the operation in all
|
|
devices. Eg. for CXL memory repair, the specific value of
|
|
nibble mask to use will be provided in related error records.
|
|
For more details, See nibble mask field in CXL spec ver 3.1,
|
|
section 8.2.9.7.1.2 Table 8-103 soft PPR and section
|
|
8.2.9.7.1.3 Table 8-104 hard PPR, section 8.2.9.7.1.4
|
|
Table 8-105 memory sparing.
|
|
|
|
What: /sys/bus/edac/devices/<dev-name>/mem_repairX/min_hpa
|
|
What: /sys/bus/edac/devices/<dev-name>/mem_repairX/max_hpa
|
|
What: /sys/bus/edac/devices/<dev-name>/mem_repairX/min_dpa
|
|
What: /sys/bus/edac/devices/<dev-name>/mem_repairX/max_dpa
|
|
Date: March 2025
|
|
KernelVersion: 6.15
|
|
Contact: linux-edac@vger.kernel.org
|
|
Description:
|
|
(RW) The supported range of memory address that is to be
|
|
repaired. The memory device may give the supported range of
|
|
attributes to use and it will depend on the memory device
|
|
and the portion of memory to repair.
|
|
The userspace may receive the specific value of attributes
|
|
to use for a repair operation from the memory device via
|
|
related error records and trace events, for eg. CXL DRAM
|
|
and CXL general media error records in CXL memory devices.
|
|
|
|
What: /sys/bus/edac/devices/<dev-name>/mem_repairX/bank_group
|
|
What: /sys/bus/edac/devices/<dev-name>/mem_repairX/bank
|
|
What: /sys/bus/edac/devices/<dev-name>/mem_repairX/rank
|
|
What: /sys/bus/edac/devices/<dev-name>/mem_repairX/row
|
|
What: /sys/bus/edac/devices/<dev-name>/mem_repairX/column
|
|
What: /sys/bus/edac/devices/<dev-name>/mem_repairX/channel
|
|
What: /sys/bus/edac/devices/<dev-name>/mem_repairX/sub_channel
|
|
Date: March 2025
|
|
KernelVersion: 6.15
|
|
Contact: linux-edac@vger.kernel.org
|
|
Description:
|
|
(RW) The control attributes for the memory to be repaired.
|
|
The specific value of attributes to use depends on the
|
|
portion of memory to repair and will be reported to the host
|
|
in related error records and be available to userspace
|
|
in trace events, such as CXL DRAM and CXL general media
|
|
error records of CXL memory devices.
|
|
|
|
When readng back these attributes, it returns the current
|
|
value of memory requested to be repaired.
|
|
|
|
bank_group - The bank group of the memory to repair.
|
|
|
|
bank - The bank number of the memory to repair.
|
|
|
|
rank - The rank of the memory to repair. Rank is defined as a
|
|
set of memory devices on a channel that together execute a
|
|
transaction.
|
|
|
|
row - The row number of the memory to repair.
|
|
|
|
column - The column number of the memory to repair.
|
|
|
|
channel - The channel of the memory to repair. Channel is
|
|
defined as an interface that can be independently accessed
|
|
for a transaction.
|
|
|
|
sub_channel - The subchannel of the memory to repair.
|
|
|
|
The requirement to set these attributes varies based on the
|
|
repair function. The attributes in sysfs are not present
|
|
unless required for a repair function.
|
|
|
|
For example, CXL spec ver 3.1, Section 8.2.9.7.1.2 Table 8-103
|
|
soft PPR and Section 8.2.9.7.1.3 Table 8-104 hard PPR operations,
|
|
these attributes are not required to set. CXL spec ver 3.1,
|
|
Section 8.2.9.7.1.4 Table 8-105 memory sparing, these attributes
|
|
are required to set based on memory sparing granularity.
|
|
|
|
What: /sys/bus/edac/devices/<dev-name>/mem_repairX/repair
|
|
Date: March 2025
|
|
KernelVersion: 6.15
|
|
Contact: linux-edac@vger.kernel.org
|
|
Description:
|
|
(WO) Issue the memory repair operation for the specified
|
|
memory repair attributes. The operation may fail if resources
|
|
are insufficient based on the requirements of the memory
|
|
device and repair function.
|
|
|
|
- 1 - Issue the repair operation.
|
|
|
|
- All other values are reserved.
|