EDAC: Add a memory repair control feature
Add a generic EDAC memory repair control driver to manage memory repairs in
the system, such as CXL Post Package Repair (PPR) and other soft and hard PPR
features.
For example, a CXL device with DRAM components that support PPR features may
implement PPR maintenance operations. DRAM components may support two types of
PPR:
- hard PPR, for a permanent row repair, and
- soft PPR, for a temporary row repair.
Soft PPR is much faster than hard PPR, but the repair is lost with a power
cycle.
When a CXL device detects an error in a memory, it may report the need for
a repair maintenance operation by using an event record where the "maintenance
needed" flag is set. The event records contain the device physical
address (DPA) and other optional attributes of the memory to repair.
The kernel will report the corresponding CXL general media or DRAM trace event
to userspace, and userspace tools (e.g. rasdaemon) will initiate a repair
operation in response to the device request via the sysfs repair control.
Device with memory repair features registers with EDAC device driver, which
retrieves a memory repair descriptor from EDAC memory repair driver and exposes
the sysfs repair control attributes to userspace in
/sys/bus/edac/devices/<dev-name>/mem_repairX/.
The common memory repair control interface abstracts the control of arbitrary
memory repair functionality into a standardized set of functions. The sysfs
memory repair attribute nodes are only available if the client driver has
implemented the corresponding attribute callback function and provided
operations to the EDAC device driver during registration.
[ bp: Massage, fixup edac_dev_register() retvals, merge
write_overflow fix to mem_repair_create_desc() ]
Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20250212143654.1893-5-shiju.jose@huawei.com
2025-02-12 14:36:42 +00:00
|
|
|
// SPDX-License-Identifier: GPL-2.0
|
|
|
|
/*
|
|
|
|
* The generic EDAC memory repair driver is designed to control the memory
|
|
|
|
* devices with memory repair features, such as Post Package Repair (PPR),
|
|
|
|
* memory sparing etc. The common sysfs memory repair interface abstracts
|
|
|
|
* the control of various arbitrary memory repair functionalities into a
|
|
|
|
* unified set of functions.
|
|
|
|
*
|
|
|
|
* Copyright (c) 2024-2025 HiSilicon Limited.
|
|
|
|
*/
|
|
|
|
|
|
|
|
#include <linux/edac.h>
|
|
|
|
|
|
|
|
enum edac_mem_repair_attributes {
|
|
|
|
MR_TYPE,
|
|
|
|
MR_PERSIST_MODE,
|
|
|
|
MR_SAFE_IN_USE,
|
|
|
|
MR_HPA,
|
|
|
|
MR_MIN_HPA,
|
|
|
|
MR_MAX_HPA,
|
|
|
|
MR_DPA,
|
|
|
|
MR_MIN_DPA,
|
|
|
|
MR_MAX_DPA,
|
|
|
|
MR_NIBBLE_MASK,
|
2025-02-24 12:13:40 +01:00
|
|
|
MR_BANK_GROUP,
|
|
|
|
MR_BANK,
|
|
|
|
MR_RANK,
|
|
|
|
MR_ROW,
|
|
|
|
MR_COLUMN,
|
|
|
|
MR_CHANNEL,
|
|
|
|
MR_SUB_CHANNEL,
|
EDAC: Add a memory repair control feature
Add a generic EDAC memory repair control driver to manage memory repairs in
the system, such as CXL Post Package Repair (PPR) and other soft and hard PPR
features.
For example, a CXL device with DRAM components that support PPR features may
implement PPR maintenance operations. DRAM components may support two types of
PPR:
- hard PPR, for a permanent row repair, and
- soft PPR, for a temporary row repair.
Soft PPR is much faster than hard PPR, but the repair is lost with a power
cycle.
When a CXL device detects an error in a memory, it may report the need for
a repair maintenance operation by using an event record where the "maintenance
needed" flag is set. The event records contain the device physical
address (DPA) and other optional attributes of the memory to repair.
The kernel will report the corresponding CXL general media or DRAM trace event
to userspace, and userspace tools (e.g. rasdaemon) will initiate a repair
operation in response to the device request via the sysfs repair control.
Device with memory repair features registers with EDAC device driver, which
retrieves a memory repair descriptor from EDAC memory repair driver and exposes
the sysfs repair control attributes to userspace in
/sys/bus/edac/devices/<dev-name>/mem_repairX/.
The common memory repair control interface abstracts the control of arbitrary
memory repair functionality into a standardized set of functions. The sysfs
memory repair attribute nodes are only available if the client driver has
implemented the corresponding attribute callback function and provided
operations to the EDAC device driver during registration.
[ bp: Massage, fixup edac_dev_register() retvals, merge
write_overflow fix to mem_repair_create_desc() ]
Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20250212143654.1893-5-shiju.jose@huawei.com
2025-02-12 14:36:42 +00:00
|
|
|
MEM_DO_REPAIR,
|
|
|
|
MR_MAX_ATTRS
|
|
|
|
};
|
|
|
|
|
|
|
|
struct edac_mem_repair_dev_attr {
|
|
|
|
struct device_attribute dev_attr;
|
|
|
|
u8 instance;
|
|
|
|
};
|
|
|
|
|
|
|
|
struct edac_mem_repair_context {
|
|
|
|
char name[EDAC_FEAT_NAME_LEN];
|
|
|
|
struct edac_mem_repair_dev_attr mem_repair_dev_attr[MR_MAX_ATTRS];
|
|
|
|
struct attribute *mem_repair_attrs[MR_MAX_ATTRS + 1];
|
|
|
|
struct attribute_group group;
|
|
|
|
};
|
|
|
|
|
cxl/edac: Add CXL memory device memory sparing control feature
Memory sparing is defined as a repair function that replaces a portion of
memory with a portion of functional memory at that same DPA. The subclasses
for this operation vary in terms of the scope of the sparing being
performed. The cacheline sparing subclass refers to a sparing action that
can replace a full cacheline. Row sparing is provided as an alternative to
PPR sparing functions and its scope is that of a single DDR row.
As per CXL r3.2 Table 8-125 foot note 1. Memory sparing is preferred over
PPR when possible.
Bank sparing allows an entire bank to be replaced. Rank sparing is defined
as an operation in which an entire DDR rank is replaced.
Memory sparing maintenance operations may be supported by CXL devices
that implement CXL.mem protocol. A sparing maintenance operation requests
the CXL device to perform a repair operation on its media.
For example, a CXL device with DRAM components that support memory sparing
features may implement sparing maintenance operations.
The host may issue a query command by setting query resources flag in the
input payload (CXL spec 3.2 Table 8-120) to determine availability of
sparing resources for a given address. In response to a query request,
the device shall report the resource availability by producing the memory
sparing event record (CXL spec 3.2 Table 8-60) in which the Channel, Rank,
Nibble Mask, Bank Group, Bank, Row, Column, Sub-Channel fields are a copy
of the values specified in the request.
During the execution of a sparing maintenance operation, a CXL memory
device:
- may not retain data
- may not be able to process CXL.mem requests correctly.
These CXL memory device capabilities are specified by restriction flags
in the memory sparing feature readable attributes.
When a CXL device identifies error on a memory component, the device
may inform the host about the need for a memory sparing maintenance
operation by using DRAM event record, where the 'maintenance needed' flag
may set. The event record contains some of the DPA, Channel, Rank,
Nibble Mask, Bank Group, Bank, Row, Column, Sub-Channel fields that
should be repaired. The userspace tool requests for maintenance operation
if the 'maintenance needed' flag set in the CXL DRAM error record.
CXL spec 3.2 section 8.2.10.7.1.4 describes the device's memory sparing
maintenance operation feature.
CXL spec 3.2 section 8.2.10.7.2.3 describes the memory sparing feature
discovery and configuration.
Add support for controlling CXL memory device memory sparing feature.
Register with EDAC driver, which gets the memory repair attr descriptors
from the EDAC memory repair driver and exposes sysfs repair control
attributes for memory sparing to the userspace. For example CXL memory
sparing control for the CXL mem0 device is exposed in
/sys/bus/edac/devices/cxl_mem0/mem_repairX/
Use case
========
1. CXL device identifies a failure in a memory component, report to
userspace in a CXL DRAM trace event with DPA and other attributes of
memory to repair such as channel, rank, nibble mask, bank Group,
bank, row, column, sub-channel.
2. Rasdaemon process the trace event and may issue query request in sysfs
check resources available for memory sparing if either of the following
conditions met.
- 'maintenance needed' flag set in the event record.
- 'threshold event' flag set for CVME threshold feature.
- When the number of corrected error reported on a CXL.mem media to the
userspace exceeds the threshold value for corrected error count defined
by the userspace policy.
3. Rasdaemon process the memory sparing trace event and issue repair
request for memory sparing.
Kernel CXL driver shall report memory sparing event record to the userspace
with the resource availability in order rasdaemon to process the event
record and issue a repair request in sysfs for the memory sparing operation
in the CXL device.
Note: Based on the feedbacks from the community 'query' sysfs attribute is
removed and reporting memory sparing error record to the userspace are not
supported. Instead userspace issues sparing operation and kernel does the
same to the CXL memory device, when 'maintenance needed' flag set in the
DRAM event record.
Add checks to ensure the memory to be repaired is offline and if online,
then originates from a CXL DRAM error record reported in the current boot
before requesting a memory sparing operation on the device.
Note: Tested memory sparing feature control with QEMU patch
"hw/cxl: Add emulation for memory sparing control feature"
https://lore.kernel.org/linux-cxl/20250509172229.726-1-shiju.jose@huawei.com/T/#m5f38512a95670d75739f9dad3ee91b95c7f5c8d6
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Reviewed-by: Alison Schofield <alison.schofield@intel.com>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Link: https://patch.msgid.link/20250521124749.817-8-shiju.jose@huawei.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
2025-05-21 13:47:45 +01:00
|
|
|
const char * const edac_repair_type[] = {
|
|
|
|
[EDAC_REPAIR_PPR] = "ppr",
|
|
|
|
[EDAC_REPAIR_CACHELINE_SPARING] = "cacheline-sparing",
|
|
|
|
[EDAC_REPAIR_ROW_SPARING] = "row-sparing",
|
|
|
|
[EDAC_REPAIR_BANK_SPARING] = "bank-sparing",
|
|
|
|
[EDAC_REPAIR_RANK_SPARING] = "rank-sparing",
|
|
|
|
};
|
|
|
|
EXPORT_SYMBOL_GPL(edac_repair_type);
|
|
|
|
|
EDAC: Add a memory repair control feature
Add a generic EDAC memory repair control driver to manage memory repairs in
the system, such as CXL Post Package Repair (PPR) and other soft and hard PPR
features.
For example, a CXL device with DRAM components that support PPR features may
implement PPR maintenance operations. DRAM components may support two types of
PPR:
- hard PPR, for a permanent row repair, and
- soft PPR, for a temporary row repair.
Soft PPR is much faster than hard PPR, but the repair is lost with a power
cycle.
When a CXL device detects an error in a memory, it may report the need for
a repair maintenance operation by using an event record where the "maintenance
needed" flag is set. The event records contain the device physical
address (DPA) and other optional attributes of the memory to repair.
The kernel will report the corresponding CXL general media or DRAM trace event
to userspace, and userspace tools (e.g. rasdaemon) will initiate a repair
operation in response to the device request via the sysfs repair control.
Device with memory repair features registers with EDAC device driver, which
retrieves a memory repair descriptor from EDAC memory repair driver and exposes
the sysfs repair control attributes to userspace in
/sys/bus/edac/devices/<dev-name>/mem_repairX/.
The common memory repair control interface abstracts the control of arbitrary
memory repair functionality into a standardized set of functions. The sysfs
memory repair attribute nodes are only available if the client driver has
implemented the corresponding attribute callback function and provided
operations to the EDAC device driver during registration.
[ bp: Massage, fixup edac_dev_register() retvals, merge
write_overflow fix to mem_repair_create_desc() ]
Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20250212143654.1893-5-shiju.jose@huawei.com
2025-02-12 14:36:42 +00:00
|
|
|
#define TO_MR_DEV_ATTR(_dev_attr) \
|
|
|
|
container_of(_dev_attr, struct edac_mem_repair_dev_attr, dev_attr)
|
|
|
|
|
|
|
|
#define MR_ATTR_SHOW(attrib, cb, type, format) \
|
|
|
|
static ssize_t attrib##_show(struct device *ras_feat_dev, \
|
|
|
|
struct device_attribute *attr, char *buf) \
|
|
|
|
{ \
|
|
|
|
u8 inst = TO_MR_DEV_ATTR(attr)->instance; \
|
|
|
|
struct edac_dev_feat_ctx *ctx = dev_get_drvdata(ras_feat_dev); \
|
|
|
|
const struct edac_mem_repair_ops *ops = \
|
|
|
|
ctx->mem_repair[inst].mem_repair_ops; \
|
|
|
|
type data; \
|
|
|
|
int ret; \
|
|
|
|
\
|
|
|
|
ret = ops->cb(ras_feat_dev->parent, ctx->mem_repair[inst].private, \
|
|
|
|
&data); \
|
|
|
|
if (ret) \
|
|
|
|
return ret; \
|
|
|
|
\
|
|
|
|
return sysfs_emit(buf, format, data); \
|
|
|
|
}
|
|
|
|
|
|
|
|
MR_ATTR_SHOW(repair_type, get_repair_type, const char *, "%s\n")
|
|
|
|
MR_ATTR_SHOW(persist_mode, get_persist_mode, bool, "%u\n")
|
|
|
|
MR_ATTR_SHOW(repair_safe_when_in_use, get_repair_safe_when_in_use, bool, "%u\n")
|
|
|
|
MR_ATTR_SHOW(hpa, get_hpa, u64, "0x%llx\n")
|
|
|
|
MR_ATTR_SHOW(min_hpa, get_min_hpa, u64, "0x%llx\n")
|
|
|
|
MR_ATTR_SHOW(max_hpa, get_max_hpa, u64, "0x%llx\n")
|
|
|
|
MR_ATTR_SHOW(dpa, get_dpa, u64, "0x%llx\n")
|
|
|
|
MR_ATTR_SHOW(min_dpa, get_min_dpa, u64, "0x%llx\n")
|
|
|
|
MR_ATTR_SHOW(max_dpa, get_max_dpa, u64, "0x%llx\n")
|
|
|
|
MR_ATTR_SHOW(nibble_mask, get_nibble_mask, u32, "0x%x\n")
|
2025-02-24 12:13:40 +01:00
|
|
|
MR_ATTR_SHOW(bank_group, get_bank_group, u32, "%u\n")
|
|
|
|
MR_ATTR_SHOW(bank, get_bank, u32, "%u\n")
|
|
|
|
MR_ATTR_SHOW(rank, get_rank, u32, "%u\n")
|
|
|
|
MR_ATTR_SHOW(row, get_row, u32, "0x%x\n")
|
|
|
|
MR_ATTR_SHOW(column, get_column, u32, "%u\n")
|
|
|
|
MR_ATTR_SHOW(channel, get_channel, u32, "%u\n")
|
|
|
|
MR_ATTR_SHOW(sub_channel, get_sub_channel, u32, "%u\n")
|
EDAC: Add a memory repair control feature
Add a generic EDAC memory repair control driver to manage memory repairs in
the system, such as CXL Post Package Repair (PPR) and other soft and hard PPR
features.
For example, a CXL device with DRAM components that support PPR features may
implement PPR maintenance operations. DRAM components may support two types of
PPR:
- hard PPR, for a permanent row repair, and
- soft PPR, for a temporary row repair.
Soft PPR is much faster than hard PPR, but the repair is lost with a power
cycle.
When a CXL device detects an error in a memory, it may report the need for
a repair maintenance operation by using an event record where the "maintenance
needed" flag is set. The event records contain the device physical
address (DPA) and other optional attributes of the memory to repair.
The kernel will report the corresponding CXL general media or DRAM trace event
to userspace, and userspace tools (e.g. rasdaemon) will initiate a repair
operation in response to the device request via the sysfs repair control.
Device with memory repair features registers with EDAC device driver, which
retrieves a memory repair descriptor from EDAC memory repair driver and exposes
the sysfs repair control attributes to userspace in
/sys/bus/edac/devices/<dev-name>/mem_repairX/.
The common memory repair control interface abstracts the control of arbitrary
memory repair functionality into a standardized set of functions. The sysfs
memory repair attribute nodes are only available if the client driver has
implemented the corresponding attribute callback function and provided
operations to the EDAC device driver during registration.
[ bp: Massage, fixup edac_dev_register() retvals, merge
write_overflow fix to mem_repair_create_desc() ]
Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20250212143654.1893-5-shiju.jose@huawei.com
2025-02-12 14:36:42 +00:00
|
|
|
|
|
|
|
#define MR_ATTR_STORE(attrib, cb, type, conv_func) \
|
|
|
|
static ssize_t attrib##_store(struct device *ras_feat_dev, \
|
|
|
|
struct device_attribute *attr, \
|
|
|
|
const char *buf, size_t len) \
|
|
|
|
{ \
|
|
|
|
u8 inst = TO_MR_DEV_ATTR(attr)->instance; \
|
|
|
|
struct edac_dev_feat_ctx *ctx = dev_get_drvdata(ras_feat_dev); \
|
|
|
|
const struct edac_mem_repair_ops *ops = \
|
|
|
|
ctx->mem_repair[inst].mem_repair_ops; \
|
|
|
|
type data; \
|
|
|
|
int ret; \
|
|
|
|
\
|
|
|
|
ret = conv_func(buf, 0, &data); \
|
|
|
|
if (ret < 0) \
|
|
|
|
return ret; \
|
|
|
|
\
|
|
|
|
ret = ops->cb(ras_feat_dev->parent, ctx->mem_repair[inst].private, \
|
|
|
|
data); \
|
|
|
|
if (ret) \
|
|
|
|
return ret; \
|
|
|
|
\
|
|
|
|
return len; \
|
|
|
|
}
|
|
|
|
|
|
|
|
MR_ATTR_STORE(persist_mode, set_persist_mode, unsigned long, kstrtoul)
|
|
|
|
MR_ATTR_STORE(hpa, set_hpa, u64, kstrtou64)
|
|
|
|
MR_ATTR_STORE(dpa, set_dpa, u64, kstrtou64)
|
|
|
|
MR_ATTR_STORE(nibble_mask, set_nibble_mask, unsigned long, kstrtoul)
|
2025-02-24 12:13:40 +01:00
|
|
|
MR_ATTR_STORE(bank_group, set_bank_group, unsigned long, kstrtoul)
|
|
|
|
MR_ATTR_STORE(bank, set_bank, unsigned long, kstrtoul)
|
|
|
|
MR_ATTR_STORE(rank, set_rank, unsigned long, kstrtoul)
|
|
|
|
MR_ATTR_STORE(row, set_row, unsigned long, kstrtoul)
|
|
|
|
MR_ATTR_STORE(column, set_column, unsigned long, kstrtoul)
|
|
|
|
MR_ATTR_STORE(channel, set_channel, unsigned long, kstrtoul)
|
|
|
|
MR_ATTR_STORE(sub_channel, set_sub_channel, unsigned long, kstrtoul)
|
EDAC: Add a memory repair control feature
Add a generic EDAC memory repair control driver to manage memory repairs in
the system, such as CXL Post Package Repair (PPR) and other soft and hard PPR
features.
For example, a CXL device with DRAM components that support PPR features may
implement PPR maintenance operations. DRAM components may support two types of
PPR:
- hard PPR, for a permanent row repair, and
- soft PPR, for a temporary row repair.
Soft PPR is much faster than hard PPR, but the repair is lost with a power
cycle.
When a CXL device detects an error in a memory, it may report the need for
a repair maintenance operation by using an event record where the "maintenance
needed" flag is set. The event records contain the device physical
address (DPA) and other optional attributes of the memory to repair.
The kernel will report the corresponding CXL general media or DRAM trace event
to userspace, and userspace tools (e.g. rasdaemon) will initiate a repair
operation in response to the device request via the sysfs repair control.
Device with memory repair features registers with EDAC device driver, which
retrieves a memory repair descriptor from EDAC memory repair driver and exposes
the sysfs repair control attributes to userspace in
/sys/bus/edac/devices/<dev-name>/mem_repairX/.
The common memory repair control interface abstracts the control of arbitrary
memory repair functionality into a standardized set of functions. The sysfs
memory repair attribute nodes are only available if the client driver has
implemented the corresponding attribute callback function and provided
operations to the EDAC device driver during registration.
[ bp: Massage, fixup edac_dev_register() retvals, merge
write_overflow fix to mem_repair_create_desc() ]
Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20250212143654.1893-5-shiju.jose@huawei.com
2025-02-12 14:36:42 +00:00
|
|
|
|
|
|
|
#define MR_DO_OP(attrib, cb) \
|
|
|
|
static ssize_t attrib##_store(struct device *ras_feat_dev, \
|
|
|
|
struct device_attribute *attr, \
|
|
|
|
const char *buf, size_t len) \
|
|
|
|
{ \
|
|
|
|
u8 inst = TO_MR_DEV_ATTR(attr)->instance; \
|
|
|
|
struct edac_dev_feat_ctx *ctx = dev_get_drvdata(ras_feat_dev); \
|
|
|
|
const struct edac_mem_repair_ops *ops = ctx->mem_repair[inst].mem_repair_ops; \
|
|
|
|
unsigned long data; \
|
|
|
|
int ret; \
|
|
|
|
\
|
|
|
|
ret = kstrtoul(buf, 0, &data); \
|
|
|
|
if (ret < 0) \
|
|
|
|
return ret; \
|
|
|
|
\
|
|
|
|
ret = ops->cb(ras_feat_dev->parent, ctx->mem_repair[inst].private, data); \
|
|
|
|
if (ret) \
|
|
|
|
return ret; \
|
|
|
|
\
|
|
|
|
return len; \
|
|
|
|
}
|
|
|
|
|
|
|
|
MR_DO_OP(repair, do_repair)
|
|
|
|
|
|
|
|
static umode_t mem_repair_attr_visible(struct kobject *kobj, struct attribute *a, int attr_id)
|
|
|
|
{
|
|
|
|
struct device *ras_feat_dev = kobj_to_dev(kobj);
|
|
|
|
struct device_attribute *dev_attr = container_of(a, struct device_attribute, attr);
|
|
|
|
struct edac_dev_feat_ctx *ctx = dev_get_drvdata(ras_feat_dev);
|
|
|
|
u8 inst = TO_MR_DEV_ATTR(dev_attr)->instance;
|
|
|
|
const struct edac_mem_repair_ops *ops = ctx->mem_repair[inst].mem_repair_ops;
|
|
|
|
|
|
|
|
switch (attr_id) {
|
|
|
|
case MR_TYPE:
|
|
|
|
if (ops->get_repair_type)
|
|
|
|
return a->mode;
|
|
|
|
break;
|
|
|
|
case MR_PERSIST_MODE:
|
|
|
|
if (ops->get_persist_mode) {
|
|
|
|
if (ops->set_persist_mode)
|
|
|
|
return a->mode;
|
|
|
|
else
|
|
|
|
return 0444;
|
|
|
|
}
|
|
|
|
break;
|
|
|
|
case MR_SAFE_IN_USE:
|
|
|
|
if (ops->get_repair_safe_when_in_use)
|
|
|
|
return a->mode;
|
|
|
|
break;
|
|
|
|
case MR_HPA:
|
|
|
|
if (ops->get_hpa) {
|
|
|
|
if (ops->set_hpa)
|
|
|
|
return a->mode;
|
|
|
|
else
|
|
|
|
return 0444;
|
|
|
|
}
|
|
|
|
break;
|
|
|
|
case MR_MIN_HPA:
|
|
|
|
if (ops->get_min_hpa)
|
|
|
|
return a->mode;
|
|
|
|
break;
|
|
|
|
case MR_MAX_HPA:
|
|
|
|
if (ops->get_max_hpa)
|
|
|
|
return a->mode;
|
|
|
|
break;
|
|
|
|
case MR_DPA:
|
|
|
|
if (ops->get_dpa) {
|
|
|
|
if (ops->set_dpa)
|
|
|
|
return a->mode;
|
|
|
|
else
|
|
|
|
return 0444;
|
|
|
|
}
|
|
|
|
break;
|
|
|
|
case MR_MIN_DPA:
|
|
|
|
if (ops->get_min_dpa)
|
|
|
|
return a->mode;
|
|
|
|
break;
|
|
|
|
case MR_MAX_DPA:
|
|
|
|
if (ops->get_max_dpa)
|
|
|
|
return a->mode;
|
|
|
|
break;
|
|
|
|
case MR_NIBBLE_MASK:
|
|
|
|
if (ops->get_nibble_mask) {
|
|
|
|
if (ops->set_nibble_mask)
|
|
|
|
return a->mode;
|
|
|
|
else
|
|
|
|
return 0444;
|
|
|
|
}
|
|
|
|
break;
|
2025-02-24 12:13:40 +01:00
|
|
|
case MR_BANK_GROUP:
|
|
|
|
if (ops->get_bank_group) {
|
|
|
|
if (ops->set_bank_group)
|
|
|
|
return a->mode;
|
|
|
|
else
|
|
|
|
return 0444;
|
|
|
|
}
|
|
|
|
break;
|
|
|
|
case MR_BANK:
|
|
|
|
if (ops->get_bank) {
|
|
|
|
if (ops->set_bank)
|
|
|
|
return a->mode;
|
|
|
|
else
|
|
|
|
return 0444;
|
|
|
|
}
|
|
|
|
break;
|
|
|
|
case MR_RANK:
|
|
|
|
if (ops->get_rank) {
|
|
|
|
if (ops->set_rank)
|
|
|
|
return a->mode;
|
|
|
|
else
|
|
|
|
return 0444;
|
|
|
|
}
|
|
|
|
break;
|
|
|
|
case MR_ROW:
|
|
|
|
if (ops->get_row) {
|
|
|
|
if (ops->set_row)
|
|
|
|
return a->mode;
|
|
|
|
else
|
|
|
|
return 0444;
|
|
|
|
}
|
|
|
|
break;
|
|
|
|
case MR_COLUMN:
|
|
|
|
if (ops->get_column) {
|
|
|
|
if (ops->set_column)
|
|
|
|
return a->mode;
|
|
|
|
else
|
|
|
|
return 0444;
|
|
|
|
}
|
|
|
|
break;
|
|
|
|
case MR_CHANNEL:
|
|
|
|
if (ops->get_channel) {
|
|
|
|
if (ops->set_channel)
|
|
|
|
return a->mode;
|
|
|
|
else
|
|
|
|
return 0444;
|
|
|
|
}
|
|
|
|
break;
|
|
|
|
case MR_SUB_CHANNEL:
|
|
|
|
if (ops->get_sub_channel) {
|
|
|
|
if (ops->set_sub_channel)
|
|
|
|
return a->mode;
|
|
|
|
else
|
|
|
|
return 0444;
|
|
|
|
}
|
|
|
|
break;
|
EDAC: Add a memory repair control feature
Add a generic EDAC memory repair control driver to manage memory repairs in
the system, such as CXL Post Package Repair (PPR) and other soft and hard PPR
features.
For example, a CXL device with DRAM components that support PPR features may
implement PPR maintenance operations. DRAM components may support two types of
PPR:
- hard PPR, for a permanent row repair, and
- soft PPR, for a temporary row repair.
Soft PPR is much faster than hard PPR, but the repair is lost with a power
cycle.
When a CXL device detects an error in a memory, it may report the need for
a repair maintenance operation by using an event record where the "maintenance
needed" flag is set. The event records contain the device physical
address (DPA) and other optional attributes of the memory to repair.
The kernel will report the corresponding CXL general media or DRAM trace event
to userspace, and userspace tools (e.g. rasdaemon) will initiate a repair
operation in response to the device request via the sysfs repair control.
Device with memory repair features registers with EDAC device driver, which
retrieves a memory repair descriptor from EDAC memory repair driver and exposes
the sysfs repair control attributes to userspace in
/sys/bus/edac/devices/<dev-name>/mem_repairX/.
The common memory repair control interface abstracts the control of arbitrary
memory repair functionality into a standardized set of functions. The sysfs
memory repair attribute nodes are only available if the client driver has
implemented the corresponding attribute callback function and provided
operations to the EDAC device driver during registration.
[ bp: Massage, fixup edac_dev_register() retvals, merge
write_overflow fix to mem_repair_create_desc() ]
Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20250212143654.1893-5-shiju.jose@huawei.com
2025-02-12 14:36:42 +00:00
|
|
|
case MEM_DO_REPAIR:
|
|
|
|
if (ops->do_repair)
|
|
|
|
return a->mode;
|
|
|
|
break;
|
|
|
|
default:
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2025-06-20 13:41:28 +02:00
|
|
|
static const struct device_attribute mem_repair_dev_attr[] = {
|
|
|
|
[MR_TYPE] = __ATTR_RO(repair_type),
|
|
|
|
[MR_PERSIST_MODE] = __ATTR_RW(persist_mode),
|
|
|
|
[MR_SAFE_IN_USE] = __ATTR_RO(repair_safe_when_in_use),
|
|
|
|
[MR_HPA] = __ATTR_RW(hpa),
|
|
|
|
[MR_MIN_HPA] = __ATTR_RO(min_hpa),
|
|
|
|
[MR_MAX_HPA] = __ATTR_RO(max_hpa),
|
|
|
|
[MR_DPA] = __ATTR_RW(dpa),
|
|
|
|
[MR_MIN_DPA] = __ATTR_RO(min_dpa),
|
|
|
|
[MR_MAX_DPA] = __ATTR_RO(max_dpa),
|
|
|
|
[MR_NIBBLE_MASK] = __ATTR_RW(nibble_mask),
|
|
|
|
[MR_BANK_GROUP] = __ATTR_RW(bank_group),
|
|
|
|
[MR_BANK] = __ATTR_RW(bank),
|
|
|
|
[MR_RANK] = __ATTR_RW(rank),
|
|
|
|
[MR_ROW] = __ATTR_RW(row),
|
|
|
|
[MR_COLUMN] = __ATTR_RW(column),
|
|
|
|
[MR_CHANNEL] = __ATTR_RW(channel),
|
|
|
|
[MR_SUB_CHANNEL] = __ATTR_RW(sub_channel),
|
|
|
|
[MEM_DO_REPAIR] = __ATTR_WO(repair)
|
|
|
|
};
|
EDAC: Add a memory repair control feature
Add a generic EDAC memory repair control driver to manage memory repairs in
the system, such as CXL Post Package Repair (PPR) and other soft and hard PPR
features.
For example, a CXL device with DRAM components that support PPR features may
implement PPR maintenance operations. DRAM components may support two types of
PPR:
- hard PPR, for a permanent row repair, and
- soft PPR, for a temporary row repair.
Soft PPR is much faster than hard PPR, but the repair is lost with a power
cycle.
When a CXL device detects an error in a memory, it may report the need for
a repair maintenance operation by using an event record where the "maintenance
needed" flag is set. The event records contain the device physical
address (DPA) and other optional attributes of the memory to repair.
The kernel will report the corresponding CXL general media or DRAM trace event
to userspace, and userspace tools (e.g. rasdaemon) will initiate a repair
operation in response to the device request via the sysfs repair control.
Device with memory repair features registers with EDAC device driver, which
retrieves a memory repair descriptor from EDAC memory repair driver and exposes
the sysfs repair control attributes to userspace in
/sys/bus/edac/devices/<dev-name>/mem_repairX/.
The common memory repair control interface abstracts the control of arbitrary
memory repair functionality into a standardized set of functions. The sysfs
memory repair attribute nodes are only available if the client driver has
implemented the corresponding attribute callback function and provided
operations to the EDAC device driver during registration.
[ bp: Massage, fixup edac_dev_register() retvals, merge
write_overflow fix to mem_repair_create_desc() ]
Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20250212143654.1893-5-shiju.jose@huawei.com
2025-02-12 14:36:42 +00:00
|
|
|
|
|
|
|
static int mem_repair_create_desc(struct device *dev,
|
|
|
|
const struct attribute_group **attr_groups,
|
|
|
|
u8 instance)
|
|
|
|
{
|
|
|
|
struct edac_mem_repair_context *ctx;
|
|
|
|
struct attribute_group *group;
|
|
|
|
int i;
|
|
|
|
ctx = devm_kzalloc(dev, sizeof(*ctx), GFP_KERNEL);
|
|
|
|
if (!ctx)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
for (i = 0; i < MR_MAX_ATTRS; i++) {
|
2025-06-20 13:41:28 +02:00
|
|
|
ctx->mem_repair_dev_attr[i].dev_attr = mem_repair_dev_attr[i];
|
|
|
|
ctx->mem_repair_dev_attr[i].instance = instance;
|
2025-06-26 11:13:44 +01:00
|
|
|
sysfs_attr_init(&ctx->mem_repair_dev_attr[i].dev_attr.attr);
|
EDAC: Add a memory repair control feature
Add a generic EDAC memory repair control driver to manage memory repairs in
the system, such as CXL Post Package Repair (PPR) and other soft and hard PPR
features.
For example, a CXL device with DRAM components that support PPR features may
implement PPR maintenance operations. DRAM components may support two types of
PPR:
- hard PPR, for a permanent row repair, and
- soft PPR, for a temporary row repair.
Soft PPR is much faster than hard PPR, but the repair is lost with a power
cycle.
When a CXL device detects an error in a memory, it may report the need for
a repair maintenance operation by using an event record where the "maintenance
needed" flag is set. The event records contain the device physical
address (DPA) and other optional attributes of the memory to repair.
The kernel will report the corresponding CXL general media or DRAM trace event
to userspace, and userspace tools (e.g. rasdaemon) will initiate a repair
operation in response to the device request via the sysfs repair control.
Device with memory repair features registers with EDAC device driver, which
retrieves a memory repair descriptor from EDAC memory repair driver and exposes
the sysfs repair control attributes to userspace in
/sys/bus/edac/devices/<dev-name>/mem_repairX/.
The common memory repair control interface abstracts the control of arbitrary
memory repair functionality into a standardized set of functions. The sysfs
memory repair attribute nodes are only available if the client driver has
implemented the corresponding attribute callback function and provided
operations to the EDAC device driver during registration.
[ bp: Massage, fixup edac_dev_register() retvals, merge
write_overflow fix to mem_repair_create_desc() ]
Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20250212143654.1893-5-shiju.jose@huawei.com
2025-02-12 14:36:42 +00:00
|
|
|
ctx->mem_repair_attrs[i] =
|
|
|
|
&ctx->mem_repair_dev_attr[i].dev_attr.attr;
|
|
|
|
}
|
|
|
|
|
|
|
|
sprintf(ctx->name, "%s%d", "mem_repair", instance);
|
|
|
|
group = &ctx->group;
|
|
|
|
group->name = ctx->name;
|
|
|
|
group->attrs = ctx->mem_repair_attrs;
|
|
|
|
group->is_visible = mem_repair_attr_visible;
|
|
|
|
attr_groups[0] = group;
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* edac_mem_repair_get_desc - get EDAC memory repair descriptors
|
|
|
|
* @dev: client device with memory repair feature
|
|
|
|
* @attr_groups: pointer to attribute group container
|
|
|
|
* @instance: device's memory repair instance number.
|
|
|
|
*
|
|
|
|
* Return:
|
|
|
|
* * %0 - Success.
|
|
|
|
* * %-EINVAL - Invalid parameters passed.
|
|
|
|
* * %-ENOMEM - Dynamic memory allocation failed.
|
|
|
|
*/
|
|
|
|
int edac_mem_repair_get_desc(struct device *dev,
|
|
|
|
const struct attribute_group **attr_groups, u8 instance)
|
|
|
|
{
|
|
|
|
if (!dev || !attr_groups)
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
return mem_repair_create_desc(dev, attr_groups, instance);
|
|
|
|
}
|