linux

mirror of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2025-08-05 16:54:27 +00:00

Author	SHA1	Message	Date
Qiuxu Zhuo	5904dc561e	EDAC/{skx_common,i10nm}: Add RRL support for Intel Granite Rapids server Compared to previous generations, Granite Rapids defines the RRL control bits {en_patspr, noover, en} in different positions, adds an extra RRL set for the new mode of the first patrol-scrub read error, and extends the number of CORRERRCNT registers from 4 to 8, encoding one counter per CORRERRCNT register. Add a Granite Rapids reg_rrl configuration table and adjust the code to accommodate the differences mentioned above for RRL support. Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Tested-by: Feng Xu <feng.f.xu@intel.com> Link: https://lore.kernel.org/r/20250417150724.1170168-8-qiuxu.zhuo@intel.com	2025-04-17 10:45:21 -07:00
Qiuxu Zhuo	126168fa2c	EDAC/{skx_common,i10nm}: Refactor show_retry_rd_err_log() Make the {valid bit, overwritten status, number} of RRL registers and the {number, offsets, widths} of per-channel CORRERRCNT registers configurable. Refactor show_retry_rd_err_log() to use the configurable fields of struct reg_rrl, making the code more scalable and simpler. No functional changes intended. Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Tested-by: Feng Xu <feng.f.xu@intel.com> Link: https://lore.kernel.org/r/20250417150724.1170168-7-qiuxu.zhuo@intel.com	2025-04-17 10:33:37 -07:00
Qiuxu Zhuo	ba3985c1fa	EDAC/{skx_common,i10nm}: Refactor enable_retry_rd_err_log() Refactor enable_retry_rd_err_log() using helper functions for both DDR and HBM, making the RRL control bits configurable instead of hard-coded. Additionally, explicitly define the four RRL modes for better readability. No functional changes intended. Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Tested-by: Feng Xu <feng.f.xu@intel.com> Link: https://lore.kernel.org/r/20250417150724.1170168-6-qiuxu.zhuo@intel.com	2025-04-17 10:31:09 -07:00
Qiuxu Zhuo	1a8a6af663	EDAC/{skx_common,i10nm}: Structure the per-channel RRL registers As the number of RRL (retry_rd_err_log) registers per memory channel increases, the positions of the RRL control bits and the widths of the RRL registers vary across different CPU generations. Adding RRL support for a new CPU requires handling these differences throughout the RRL-related code. Structure the offsets, widths, control bit positions, set numbers, modes, etc., of the per-channel RRL registers and make them configurable to facilitate easier RRL support for new CPUs. No functional changes are intended. Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Tested-by: Feng Xu <feng.f.xu@intel.com> Link: https://lore.kernel.org/r/20250417150724.1170168-5-qiuxu.zhuo@intel.com	2025-04-17 10:28:09 -07:00
Qiuxu Zhuo	eeed3e03f4	EDAC/{skx_common,i10nm}: Fix the loss of saved RRL for HBM pseudo channel 0 When enabling the retry_rd_err_log (RRL) feature during the loading of the i10nm_edac driver with the module parameter retry_rd_err_log=2 (Linux RRL control mode), the default values of the control bits of RRL are saved so that they can be restored during the unloading of the driver. In the current code, the RRL of pseudo channel 1 of HBM overwrites pseudo channel 0 during the loading of the driver, resulting in the loss of saved RRL for pseudo channel 0. This causes the RRL of pseudo channel 0 of HBM to be wrongly restored with the values from pseudo channel 1 when unloading the driver. Fix this issue by creating two separate groups of RRL control registers per channel to save default RRL settings of two {sub-,pseudo-}channels. Fixes: `acd4cf68fe` ("EDAC/i10nm: Retrieve and print retry_rd_err_log registers for HBM") Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Tested-by: Feng Xu <feng.f.xu@intel.com> Link: https://lore.kernel.org/r/20250417150724.1170168-3-qiuxu.zhuo@intel.com	2025-04-17 10:22:56 -07:00
Qiuxu Zhuo	d9207cf776	EDAC/{skx_common,i10nm}: Fix some missing error reports on Emerald Rapids When doing error injection to some memory DIMMs on certain Intel Emerald Rapids servers, the i10nm_edac missed error reports for some memory DIMMs. Certain BIOS configurations may hide some memory controllers, and the i10nm_edac doesn't enumerate these hidden memory controllers. However, the ADXL decodes memory errors using memory controller physical indices even if there are hidden memory controllers. Therefore, the memory controller physical indices reported by the ADXL may mismatch the logical indices enumerated by the i10nm_edac, resulting in missed error reports for some memory DIMMs. Fix this issue by creating a mapping table from memory controller physical indices (used by the ADXL) to logical indices (used by the i10nm_edac) and using it to convert the physical indices to the logical indices during the error handling process. Fixes: `c545f5e412` ("EDAC/i10nm: Skip the absent memory controllers") Reported-by: Kevin Chang <kevin1.chang@intel.com> Tested-by: Kevin Chang <kevin1.chang@intel.com> Reported-by: Thomas Chen <Thomas.Chen@intel.com> Tested-by: Thomas Chen <Thomas.Chen@intel.com> Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/r/20250214002728.6287-1-qiuxu.zhuo@intel.com	2025-02-20 17:02:33 -08:00
Kyle Meyer	584e09743d	EDAC/{i10nm,skx,skx_common}: Support UV systems The 3-bit source IDs in PCI configuration space registers, used to map devices to sockets, are limited to 8 unique IDs, and each ID is local to a UPI/QPI domain. Source IDs cannot be used to map devices to sockets on UV systems because they can exceed 8 sockets and have multiple UPI/QPI domains with identical, repeating source IDs. Use NUMA information to get package IDs instead of source IDs on UV systems, and use package/source IDs to name IMC information structures. Signed-off-by: Kyle Meyer <kyle.meyer@hpe.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Tested-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Reviewed-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Link: https://lore.kernel.org/all/20241213012549.43099-1-kyle.meyer@hpe.com/	2024-12-13 11:10:31 -08:00
Qiuxu Zhuo	a36667037a	EDAC/{skx_common,i10nm}: Fix incorrect far-memory error source indicator The Granite Rapids CPUs with Flat2LM memory configurations may mistakenly report near-memory errors as far-memory errors, resulting in the invalid decoded ADXL results: EDAC skx: Bad imc -1 Fix this incorrect far-memory error source indicator by prefetching the decoded far-memory controller ID, and adjust the error source indicator to near-memory if the far-memory controller ID is invalid. Fixes: `ba987eaaab` ("EDAC/i10nm: Add Intel Granite Rapids server support") Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Tested-by: Diego Garcia Rodriguez <diego.garcia.rodriguez@intel.com> Link: https://lore.kernel.org/r/20241015072236.24543-3-qiuxu.zhuo@intel.com	2024-10-23 11:59:21 -07:00
Qiuxu Zhuo	2397f79573	EDAC/skx_common: Differentiate memory error sources The current skx_common determines whether the memory error source is the near memory of the 2LM system and then retrieves the decoded error results from the ADXL components (near-memory vs. far-memory) accordingly. However, some memory controllers may have limitations in correctly reporting the memory error source, leading to the retrieval of incorrect decoded parts from the ADXL. To address these limitations, instead of simply determining whether the memory error is from the near memory of the 2LM system, it is necessary to distinguish the memory error source details as follows: Memory error from the near memory of the 2LM system. Memory error from the far memory of the 2LM system. Memory error from the 1LM system. Not a memory error. This will enable the i10nm_edac driver to take appropriate actions for those memory controllers that have limitations in reporting the memory error source. Fixes: `ba987eaaab` ("EDAC/i10nm: Add Intel Granite Rapids server support") Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Tested-by: Diego Garcia Rodriguez <diego.garcia.rodriguez@intel.com> Link: https://lore.kernel.org/r/20241015072236.24543-2-qiuxu.zhuo@intel.com	2024-10-23 11:58:43 -07:00
Linus Torvalds	7dfc15c473	- Drop a now obsolete ppc4xx_edac driver - Fix conversion to physical memory addresses on Intel's Elkhart Lake and Ice Lake hardware when the system address is above the (Top-Of-Memory) TOM address - Pay attention to the memory hole on Zynq UltraScale+ MPSoC DDR controllers when injecting errors for testing purposes - Add support for translating normalized error addresses reported by an AMD memory controller into system physical addresses using an UEFI mechanism called platform runtime mechanism (PRM). - The usual cleanups and fixes -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmbeuNcACgkQEsHwGGHe VUoELw//fZaWbfYg7yYw8iTMojc01LCmS5m6nQeJc6PewcIfLp6FXr4V4Rq99NUn FBVIMunm0unRAqep9WTY+xphxlP9u9VovyaLR0cxRf1aEi3xRFit7PIG7P3RyTUn ipDKBnx0plTlwB9US5XllhGCM6xAvrNBoKPe1LV+bd7z9wOJvIy3GeV/65ajLsLV +7wNBJ8CMXIJ+319FK35ZUM1butp2XFLVtLqKL53nPsumowZcegfaD1u6sfsX4SO je8BpNMXKHl0ftZ3DPAMAGrr4M54lsXX/62k3PqcUr4LMbVGLzQmDGyoHUWwdruT OGb5tVWqBXoR6DA03/P25q1SGKwGsbuzK33E8T9vkwIqBrj73vA+tVBv03U3QFMO RSb4/BS09q/GtA70OFCnigumLoKMmuZu0tcLGQaUMP6sWVVVMp1vVctTapl22h57 sonEUf0+GMsVu4ueS/vSfU3R3Dqadg/4LxZPG7njc06hCNDAu7u4/0gGdGuiQwqF ZyLUZO3SlJX/SkWfNyW4Lc4GNWRWgtFfh5sgODxATCE5NyUrazsQZg5Jsxr/5Jwv aBDsbHEUHO0zKRGfDBfHyaWK8318z+my8zvVhIGLuQCKEY8GSTK35rfthkp6vbEe UNrCgea+HaDZt6jN4ahaZjK/0DjiMSO12gA3GPt7tdO6v+U46/0= =+/Fq -----END PGP SIGNATURE----- Merge tag 'edac_updates_for_v6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras Pull EDAC updates from Borislav Petkov: - Drop a now obsolete ppc4xx_edac driver - Fix conversion to physical memory addresses on Intel's Elkhart Lake and Ice Lake hardware when the system address is above the (Top-Of-Memory) TOM address - Pay attention to the memory hole on Zynq UltraScale+ MPSoC DDR controllers when injecting errors for testing purposes - Add support for translating normalized error addresses reported by an AMD memory controller into system physical addresses using an UEFI mechanism called platform runtime mechanism (PRM). - The usual cleanups and fixes * tag 'edac_updates_for_v6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras: EDAC: Drop obsolete PPC4xx driver EDAC/sb_edac: Fix the compile warning of large frame size EDAC/{skx_common,i10nm}: Remove the AMAP register for determing DDR5 EDAC/{skx_common,skx,i10nm}: Move the common debug code to skx_common EDAC/igen6: Fix conversion of system address to physical memory address EDAC/synopsys: Fix error injection on Zynq UltraScale+ RAS/AMD/ATL: Translate normalized to system physical addresses using PRM ACPI: PRM: Add PRM handler direct call support	2024-09-16 06:36:37 +02:00
Qiuxu Zhuo	8b93582353	EDAC/{skx_common,skx,i10nm}: Move the common debug code to skx_common Commit afdb82fd763c ("EDAC, i10nm: make skx_common.o a separate module") made skx_common.o a separate module. With skx_common.o now a separate module, move the common debug code setup_{skx,i10nm}_debug() and teardown_{skx,i10nm}_debug() in {skx,i10nm}_base.c to skx_common.c to reduce code duplication. Additionally, prefix these function names with 'skx' to maintain consistency with other names in the file. Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/all/20240829055101.56245-1-qiuxu.zhuo@intel.com	2024-09-03 12:35:06 -07:00
Linus Torvalds	1a251f52cf	minmax: make generic MIN() and MAX() macros available everywhere This just standardizes the use of MIN() and MAX() macros, with the very traditional semantics. The goal is to use these for C constant expressions and for top-level / static initializers, and so be able to simplify the min()/max() macros. These macro names were used by various kernel code - they are very traditional, after all - and all such users have been fixed up, with a few different approaches: - trivial duplicated macro definitions have been removed Note that 'trivial' here means that it's obviously kernel code that already included all the major kernel headers, and thus gets the new generic MIN/MAX macros automatically. - non-trivial duplicated macro definitions are guarded with #ifndef This is the "yes, they define their own versions, but no, the include situation is not entirely obvious, and maybe they don't get the generic version automatically" case. - strange use case #1 A couple of drivers decided that the way they want to describe their versioning is with #define MAJ 1 #define MIN 2 #define DRV_VERSION __stringify(MAJ) "." __stringify(MIN) which adds zero value and I just did my Alexander the Great impersonation, and rewrote that pointless Gordian knot as #define DRV_VERSION "1.2" instead. - strange use case #2 A couple of drivers thought that it's a good idea to have a random 'MIN' or 'MAX' define for a value or index into a table, rather than the traditional macro that takes arguments. These values were re-written as C enum's instead. The new function-line macros only expand when followed by an open parenthesis, and thus don't clash with enum use. Happily, there weren't really all that many of these cases, and a lot of users already had the pattern of using '#ifndef' guarding (or in one case just using '#undef MIN') before defining their own private version that does the same thing. I left such cases alone. Cc: David Laight <David.Laight@aculab.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2024-07-28 15:49:18 -07:00
Arnd Bergmann	123b158635	EDAC, i10nm: make skx_common.o a separate module Commit `598afa0504` ("kbuild: warn objects shared among multiple modules") was added to track down cases where the same object is linked into multiple modules. This can cause serious problems if some modules are builtin while others are not. That test triggers this warning: scripts/Makefile.build:236: drivers/edac/Makefile: skx_common.o is added to multiple modules: i10nm_edac skx_edac Make this a separate module instead. [Tony: Added more background details to commit message] Fixes: `d4dc89d069` ("EDAC, i10nm: Add a driver for Intel 10nm server processors") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/all/20240529095132.1929397-1-arnd@kernel.org/	2024-05-29 13:30:10 -07:00
Qiuxu Zhuo	ba987eaaab	EDAC/i10nm: Add Intel Granite Rapids server support The Granite Rapids CPU model uses similar memory controller registers as Sapphire Rapids server but with some different configurations: - Various memory controller numbers for different Granite Rapids CPUs. So detect the number of present memory controllers at run time. - Different MMIO offsets of memory controllers. - Different triples of bus/dev/fun of some PCI devices used in i10nm_edac. Add above configurations and Granite Rapids CPU model ID for EDAC support. [Tony: Fixed 2 typos s/strcture/structure/] Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/all/20230113032802.41752-1-qiuxu.zhuo@intel.com	2023-01-25 08:17:30 -08:00
Qiuxu Zhuo	dd7814b785	EDAC/i10nm: Make more configurations CPU model specific The numbers of memory controllers per socket, channels per memory controller, DIMMs per channel and the triples of bus/device/function of PCI devices used in i10nm_edac can be CPU model specific. Add new fields to the structure res_config for above numbers and triples to make them CPU model specific. Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/all/20230113032802.41752-1-qiuxu.zhuo@intel.com	2023-01-25 08:17:20 -08:00
Qiuxu Zhuo	6e8746cb73	EDAC/skx_common: Enable EDAC support for the "near" memory The current {skx,i10nm}_edac miss the EDAC support to decode errors from the 1st level memory (the fast "near" memory as cache) of the 2-level memory system. Introduce a helper function skx_error_in_mem() to check whether errors are from memory at the beginning of skx_mce_check_error(). As long as the errors are from memory (either the 1-level memory system or the 2-level memory system), decode the errors. Reported-and-tested-by: Youquan Song <youquan.song@intel.com> Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/all/20230113032802.41752-1-qiuxu.zhuo@intel.com	2023-01-25 08:16:44 -08:00
Qiuxu Zhuo	d5f5e49953	EDAC/i10nm: Print an extra register set of retry_rd_err_log Sapphire Rapids server adds an extra register set for logging more retry_rd_err_log data. So add code to print the extra register set. Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/all/20220722233338.341567-1-tony.luck@intel.com	2022-09-23 12:33:27 -07:00
Qiuxu Zhuo	acd4cf68fe	EDAC/i10nm: Retrieve and print retry_rd_err_log registers for HBM An HBM memory channel is divided into two pseudo channels. Each pseudo channel has its own retry_rd_err_log registers. Retrieve and print retry_rd_err_log registers of the HBM pseudo channel if the memory error is from HBM. Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/all/20220722233338.341567-1-tony.luck@intel.com	2022-09-23 12:33:18 -07:00
Qiuxu Zhuo	14646de48b	EDAC/skx_common: Add ChipSelect ADXL component Each pseudo channel of HBM has its own retry_rd_err_log registers. The bit 0 of ChipSelect ADXL component encodes the pseudo channel number of HBM memory. So add ChipSelect ADXL component to get HBM pseudo channel number. Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/all/20220722233338.341567-1-tony.luck@intel.com	2022-09-23 12:33:08 -07:00
Youquan Song	2738c69a88	EDAC/i10nm: Add driver decoder for Ice Lake and Tremont CPUs Current i10nm_edac only supports firmware decoder (ACPI DSM methods). MCA bank registers of Ice Lake or Tremont CPUs contain the information to decode DDR memory errors. To get better decoding performance, add the driver decoder (decoding DDR memory errors via extracting error information from MCA bank registers) for Ice Lake and Tremont CPUs. Co-developed-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Youquan Song <youquan.song@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/all/20220901194310.115427-1-tony.luck@intel.com/	2022-09-08 11:40:01 -07:00
Qiuxu Zhuo	fe32f36693	EDAC/skx_common: Use driver decoder first The performance of driver decoder[1] is better than the performance of firmware decoder[2], especially on frequent correctable errors. So use the driver decoder first, fall back to firmware decoder if the driver decoder is unavailable. Also rename the function pointer skx_decode to driver_decode (better name to contrast with adxl_decode). [1] Decode errors by extracting error information from registers of memory controllers and/or MCA bank registers. [2] Decode errors by calling ACPI DSM methods. Co-developed-by: Youquan Song <youquan.song@intel.com> Signed-off-by: Youquan Song <youquan.song@intel.com> Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/all/20220901194310.115427-1-tony.luck@intel.com/	2022-09-08 11:40:00 -07:00
Youquan Song	cf4e6d52f5	EDAC/i10nm: Retrieve and print retry_rd_err_log registers Retrieve and print retry_rd_err_log registers like the earlier change: commit `e80634a75a` ("EDAC, skx: Retrieve and print retry_rd_err_log registers") This is a little trickier than on Skylake because of potential interference with BIOS use of the same registers. The default behavior is to ignore these registers. A module parameter retry_rd_err_log(default=0) controls the mode of operation: - 0=off : Default. - 1=bios : Linux doesn't reset any control bits, but just reports values. This is "no harm" mode, but it may miss reporting some data. - 2=linux: Linux tries to take control and resets mode bits, clears valid/UC bits after reading. This should be more reliable (especially if BIOS interference is reduced by disabling eMCA reporting mode in BIOS setup). Co-developed-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Youquan Song <youquan.song@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/r/20210818175701.1611513-3-tony.luck@intel.com	2021-08-23 10:35:36 -07:00
Qiuxu Zhuo	c945088384	EDAC/i10nm: Add support for high bandwidth memory A future Xeon processor will include in-package HBM (high bandwidth memory). The in-package HBM memory controller shares the same architecture with the regular DDR memory controller. Add the HBM memory controller devices for EDAC support. Tested-by: Hongyu Ning <hongyu.ning@linux.intel.com> Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/r/20210611170123.1057025-4-tony.luck@intel.com	2021-06-17 18:19:39 -07:00
Qiuxu Zhuo	4bd4d32e9a	EDAC/i10nm: Add detection of memory levels for ICX/SPR servers Current i10nm_edac driver is only for system configured in 1-level memory. If the system is configured in 2-level memory, the driver doesn't report the 1st level memory DIMM for the error address, even if the error occurs in the 1st level memory. Both Ice Lake servers and Sapphire Rapids servers can be configured in 2-level memory. Add detection of memory levels to i10nm_edac for the two kinds of servers so that the driver can report the 2nd level memory DIMM or the 1st level memory DIMM according to error source. Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/r/20210611170123.1057025-3-tony.luck@intel.com	2021-06-17 18:19:30 -07:00
Qiuxu Zhuo	2f4348e5a8	EDAC/skx_common: Add new ADXL components for 2-level memory Some Intel servers may configure memory in 2 levels, using fast "near" memory (e.g. DDR) as a cache for larger, slower, "far" memory (e.g. 3D X-point). In these configurations the BIOS ADXL address translation for an address in a 2-level memory range will provide details of both the "near" and far components. Current exported ADXL components are only for 1-level memory system or for 2nd level memory of 2-level memory system. So add new ADXL components for 1st level memory of 2-level memory system to fully support 2-level memory system and the detection of memory error source(1st level memory or 2nd level memory). Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/r/20210611170123.1057025-2-tony.luck@intel.com	2021-06-17 18:19:22 -07:00
Qiuxu Zhuo	479f58dda2	EDAC/i10nm: Add Intel Sapphire Rapids server support The Sapphire Rapids CPU model shares the same memory controller architecture with Ice Lake server. There are some configurations different from Ice Lake server as below: - The device ID for configuration agent. - The size for per channel memory-mapped I/O. - The DDR5 memory support. So add the above configurations and the Sapphire Rapids CPU model ID for EDAC support. Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com>	2020-11-19 12:57:26 -08:00
Borislav Petkov	2a02ca0428	Merge branches 'edac-i10nm' and 'edac-misc' into edac-updates-for-5.8 Signed-off-by: Borislav Petkov <bp@suse.de>	2020-06-01 11:39:15 +02:00
Qiuxu Zhuo	1032095053	EDAC/skx: Use the mcmtr register to retrieve close_pg/bank_xor_enable The skx_edac driver wrongly uses the mtr register to retrieve two fields close_pg and bank_xor_enable. Fix it by using the correct mcmtr register to get the two fields. Cc: <stable@vger.kernel.org> Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Reported-by: Matthew Riley <mattdr@google.com> Acked-by: Aristeu Rozanski <aris@redhat.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/r/20200515210146.1337-1-tony.luck@intel.com	2020-05-19 15:11:29 -07:00
Qiuxu Zhuo	ee5340abab	EDAC, {skx,i10nm}: Make some configurations CPU model specific The device ID for configuration agent PCI device and the offset for bus number configuration register can be CPU model specific. So add a new structure res_config to make them configurable and pass res_config to {skx,i10nm}_init() and skx_get_all_bus_mappings() for use. Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20200427083246.GB11036@zn.tnic	2020-04-27 09:29:41 -07:00
Tony Luck	e80634a75a	EDAC, skx: Retrieve and print retry_rd_err_log registers Skylake logs some additional useful information in per-channel registers in addition the the architectural status/addr/misc logged in the machine check bank. Pick up this information and add it to the EDAC log: retry_rd_err_[five 32-bit register values] Sorry, no definitions for these registers. OEMs and DIMM vendors will be able to use them to isolate which cells in the DIMM are causing problems. correrrcnt[per rank corrected error counts] Note that if additional errors are logged while these registers are being read, you may see a jumble of values some from earlier errors, others from later errors (since the registers report the most recent logged error). The correrrcnt registers provide error counts per possible rank. If these counts only change by one since the previous error logged for this channel, then it is safe to assume that the registers logged provide a coherent view of one error. With this change EDAC logs look like this: EDAC MC4: 1 CE memory read error on CPU_SrcID#2_MC#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0x8f26018 offset:0x0 grain:32 syndrome:0x0 - err_code:0x0101:0x0091 socket:2 imc:0 rank:0 bg:0 ba:0 row:0x1f880 col:0x200 retry_rd_err_log[0001a209 00000000 00000001 04800001 0001f880] correrrcnt[0001 0000 0000 0000 0000 0000 0000 0000]) Acked-by: Aristeu Rozanski <aris@redhat.com> Signed-off-by: Tony Luck <tony.luck@intel.com>	2019-10-18 15:27:58 -07:00
Qiuxu Zhuo	1dc78f1ffa	EDAC, skx, i10nm: Fix source ID register offset The source ID register offset for Skylake server is 0xf0, while for Icelake server is 0xf8. Pass the correct offset to get the source ID. Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com>	2019-06-26 10:07:27 -07:00
Qiuxu Zhuo	fe783516e3	EDAC, skx, i10nm: Make skx_common.c a pure library The following Kconfig constellations fail randconfig builds: CONFIG_ACPI_NFIT=y CONFIG_EDAC_DEBUG=y CONFIG_EDAC_SKX=m CONFIG_EDAC_I10NM=y or CONFIG_ACPI_NFIT=y CONFIG_EDAC_DEBUG=y CONFIG_EDAC_SKX=y CONFIG_EDAC_I10NM=m with: ... CC [M] drivers/edac/skx_common.o ... .../skx_common.o:.../skx_common.c:672: undefined reference to `__this_module' That is because if one of the two drivers - skx_edac or i10nm_edac - is built-in and the other one is a module, the shared file skx_common.c gets linked into a module object by kbuild. Therefore, when linking that same file into vmlinux, the '__this_module' symbol used in debugfs isn't defined, leading to the above error. Fix it by moving all debugfs code from skx_common.c to both skx_base.c and i10nm_base.c respectively. Thus, skx_common.c doesn't refer to the '__this_module' symbol anymore. Clarify skx_common.c's purpose at the top of the file for future reference, while at it. [ bp: Make text more readable. ] Fixes: `d4dc89d069` ("EDAC, i10nm: Add a driver for Intel 10nm server processors") Reported-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: James Morse <james.morse@arm.com> Cc: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: linux-edac <linux-edac@vger.kernel.org> Link: https://lkml.kernel.org/r/20190321221339.GA32323@agluck-desk	2019-03-23 09:43:50 +01:00
Qiuxu Zhuo	88a242c987	EDAC, skx_common: Separate common code out from skx_edac Parts of skx_edac can be shared with the Intel 10nm server EDAC driver. Carve out the common parts from skx_edac in preparation to support both skx_edac driver and i10nm_edac drivers. Co-developed-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: James Morse <james.morse@arm.com> Cc: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: linux-edac <linux-edac@vger.kernel.org> Link: https://lkml.kernel.org/r/20190130191519.15393-3-tony.luck@intel.com	2019-02-02 10:50:59 +01:00

33 commits