linux/arch/x86/include/asm/topology.h

319 lines
8.7 KiB
C
Raw Permalink Normal View History

/*
* Written by: Matthew Dobson, IBM Corporation
*
* Copyright (C) 2002, IBM Corp.
*
* All rights reserved.
*
* This program is free software; you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
* the Free Software Foundation; either version 2 of the License, or
* (at your option) any later version.
*
* This program is distributed in the hope that it will be useful, but
* WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE, GOOD TITLE or
* NON INFRINGEMENT. See the GNU General Public License for more
* details.
*
* You should have received a copy of the GNU General Public License
* along with this program; if not, write to the Free Software
* Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
*
* Send feedback to <colpatch@us.ibm.com>
*/
#ifndef _ASM_X86_TOPOLOGY_H
#define _ASM_X86_TOPOLOGY_H
/*
* to preserve the visibility of NUMA_NO_NODE definition,
* moved to there from here. May be used independent of
* CONFIG_NUMA.
*/
#include <linux/numa.h>
#include <linux/cpumask.h>
x86: cleanup early per cpu variables/accesses v4 * Introduce a new PER_CPU macro called "EARLY_PER_CPU". This is used by some per_cpu variables that are initialized and accessed before there are per_cpu areas allocated. ["Early" in respect to per_cpu variables is "earlier than the per_cpu areas have been setup".] This patchset adds these new macros: DEFINE_EARLY_PER_CPU(_type, _name, _initvalue) EXPORT_EARLY_PER_CPU_SYMBOL(_name) DECLARE_EARLY_PER_CPU(_type, _name) early_per_cpu_ptr(_name) early_per_cpu_map(_name, _idx) early_per_cpu(_name, _cpu) The DEFINE macro defines the per_cpu variable as well as the early map and pointer. It also initializes the per_cpu variable and map elements to "_initvalue". The early_* macros provide access to the initial map (usually setup during system init) and the early pointer. This pointer is initialized to point to the early map but is then NULL'ed when the actual per_cpu areas are setup. After that the per_cpu variable is the correct access to the variable. The early_per_cpu() macro is not very efficient but does show how to access the variable if you have a function that can be called both "early" and "late". It tests the early ptr to be NULL, and if not then it's still valid. Otherwise, the per_cpu variable is used instead: #define early_per_cpu(_name, _cpu) \ (early_per_cpu_ptr(_name) ? \ early_per_cpu_ptr(_name)[_cpu] : \ per_cpu(_name, _cpu)) A better method is to actually check the pointer manually. In the case below, numa_set_node can be called both "early" and "late": void __cpuinit numa_set_node(int cpu, int node) { int *cpu_to_node_map = early_per_cpu_ptr(x86_cpu_to_node_map); if (cpu_to_node_map) cpu_to_node_map[cpu] = node; else per_cpu(x86_cpu_to_node_map, cpu) = node; } * Add a flag "arch_provides_topology_pointers" that indicates pointers to topology cpumask_t maps are available. Otherwise, use the function returning the cpumask_t value. This is useful if cpumask_t set size is very large to avoid copying data on to/off of the stack. * The coverage of CONFIG_DEBUG_PER_CPU_MAPS has been increased while the non-debug case has been optimized a bit. * Remove an unreferenced compiler warning in drivers/base/topology.c * Clean up #ifdef in setup.c For inclusion into sched-devel/latest tree. Based on: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git + sched-devel/latest .../mingo/linux-2.6-sched-devel.git Signed-off-by: Mike Travis <travis@sgi.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-05-12 21:21:12 +02:00
#ifdef CONFIG_NUMA
#include <asm/mpspec.h>
#include <asm/percpu.h>
x86: cleanup early per cpu variables/accesses v4 * Introduce a new PER_CPU macro called "EARLY_PER_CPU". This is used by some per_cpu variables that are initialized and accessed before there are per_cpu areas allocated. ["Early" in respect to per_cpu variables is "earlier than the per_cpu areas have been setup".] This patchset adds these new macros: DEFINE_EARLY_PER_CPU(_type, _name, _initvalue) EXPORT_EARLY_PER_CPU_SYMBOL(_name) DECLARE_EARLY_PER_CPU(_type, _name) early_per_cpu_ptr(_name) early_per_cpu_map(_name, _idx) early_per_cpu(_name, _cpu) The DEFINE macro defines the per_cpu variable as well as the early map and pointer. It also initializes the per_cpu variable and map elements to "_initvalue". The early_* macros provide access to the initial map (usually setup during system init) and the early pointer. This pointer is initialized to point to the early map but is then NULL'ed when the actual per_cpu areas are setup. After that the per_cpu variable is the correct access to the variable. The early_per_cpu() macro is not very efficient but does show how to access the variable if you have a function that can be called both "early" and "late". It tests the early ptr to be NULL, and if not then it's still valid. Otherwise, the per_cpu variable is used instead: #define early_per_cpu(_name, _cpu) \ (early_per_cpu_ptr(_name) ? \ early_per_cpu_ptr(_name)[_cpu] : \ per_cpu(_name, _cpu)) A better method is to actually check the pointer manually. In the case below, numa_set_node can be called both "early" and "late": void __cpuinit numa_set_node(int cpu, int node) { int *cpu_to_node_map = early_per_cpu_ptr(x86_cpu_to_node_map); if (cpu_to_node_map) cpu_to_node_map[cpu] = node; else per_cpu(x86_cpu_to_node_map, cpu) = node; } * Add a flag "arch_provides_topology_pointers" that indicates pointers to topology cpumask_t maps are available. Otherwise, use the function returning the cpumask_t value. This is useful if cpumask_t set size is very large to avoid copying data on to/off of the stack. * The coverage of CONFIG_DEBUG_PER_CPU_MAPS has been increased while the non-debug case has been optimized a bit. * Remove an unreferenced compiler warning in drivers/base/topology.c * Clean up #ifdef in setup.c For inclusion into sched-devel/latest tree. Based on: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git + sched-devel/latest .../mingo/linux-2.6-sched-devel.git Signed-off-by: Mike Travis <travis@sgi.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-05-12 21:21:12 +02:00
/* Mappings between logical cpu number and node number */
DECLARE_EARLY_PER_CPU(int, x86_cpu_to_node_map);
#ifdef CONFIG_DEBUG_PER_CPU_MAPS
/*
* override generic percpu implementation of cpu_to_node
*/
extern int __cpu_to_node(int cpu);
#define cpu_to_node __cpu_to_node
x86: cleanup early per cpu variables/accesses v4 * Introduce a new PER_CPU macro called "EARLY_PER_CPU". This is used by some per_cpu variables that are initialized and accessed before there are per_cpu areas allocated. ["Early" in respect to per_cpu variables is "earlier than the per_cpu areas have been setup".] This patchset adds these new macros: DEFINE_EARLY_PER_CPU(_type, _name, _initvalue) EXPORT_EARLY_PER_CPU_SYMBOL(_name) DECLARE_EARLY_PER_CPU(_type, _name) early_per_cpu_ptr(_name) early_per_cpu_map(_name, _idx) early_per_cpu(_name, _cpu) The DEFINE macro defines the per_cpu variable as well as the early map and pointer. It also initializes the per_cpu variable and map elements to "_initvalue". The early_* macros provide access to the initial map (usually setup during system init) and the early pointer. This pointer is initialized to point to the early map but is then NULL'ed when the actual per_cpu areas are setup. After that the per_cpu variable is the correct access to the variable. The early_per_cpu() macro is not very efficient but does show how to access the variable if you have a function that can be called both "early" and "late". It tests the early ptr to be NULL, and if not then it's still valid. Otherwise, the per_cpu variable is used instead: #define early_per_cpu(_name, _cpu) \ (early_per_cpu_ptr(_name) ? \ early_per_cpu_ptr(_name)[_cpu] : \ per_cpu(_name, _cpu)) A better method is to actually check the pointer manually. In the case below, numa_set_node can be called both "early" and "late": void __cpuinit numa_set_node(int cpu, int node) { int *cpu_to_node_map = early_per_cpu_ptr(x86_cpu_to_node_map); if (cpu_to_node_map) cpu_to_node_map[cpu] = node; else per_cpu(x86_cpu_to_node_map, cpu) = node; } * Add a flag "arch_provides_topology_pointers" that indicates pointers to topology cpumask_t maps are available. Otherwise, use the function returning the cpumask_t value. This is useful if cpumask_t set size is very large to avoid copying data on to/off of the stack. * The coverage of CONFIG_DEBUG_PER_CPU_MAPS has been increased while the non-debug case has been optimized a bit. * Remove an unreferenced compiler warning in drivers/base/topology.c * Clean up #ifdef in setup.c For inclusion into sched-devel/latest tree. Based on: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git + sched-devel/latest .../mingo/linux-2.6-sched-devel.git Signed-off-by: Mike Travis <travis@sgi.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-05-12 21:21:12 +02:00
extern int early_cpu_to_node(int cpu);
x86: cleanup early per cpu variables/accesses v4 * Introduce a new PER_CPU macro called "EARLY_PER_CPU". This is used by some per_cpu variables that are initialized and accessed before there are per_cpu areas allocated. ["Early" in respect to per_cpu variables is "earlier than the per_cpu areas have been setup".] This patchset adds these new macros: DEFINE_EARLY_PER_CPU(_type, _name, _initvalue) EXPORT_EARLY_PER_CPU_SYMBOL(_name) DECLARE_EARLY_PER_CPU(_type, _name) early_per_cpu_ptr(_name) early_per_cpu_map(_name, _idx) early_per_cpu(_name, _cpu) The DEFINE macro defines the per_cpu variable as well as the early map and pointer. It also initializes the per_cpu variable and map elements to "_initvalue". The early_* macros provide access to the initial map (usually setup during system init) and the early pointer. This pointer is initialized to point to the early map but is then NULL'ed when the actual per_cpu areas are setup. After that the per_cpu variable is the correct access to the variable. The early_per_cpu() macro is not very efficient but does show how to access the variable if you have a function that can be called both "early" and "late". It tests the early ptr to be NULL, and if not then it's still valid. Otherwise, the per_cpu variable is used instead: #define early_per_cpu(_name, _cpu) \ (early_per_cpu_ptr(_name) ? \ early_per_cpu_ptr(_name)[_cpu] : \ per_cpu(_name, _cpu)) A better method is to actually check the pointer manually. In the case below, numa_set_node can be called both "early" and "late": void __cpuinit numa_set_node(int cpu, int node) { int *cpu_to_node_map = early_per_cpu_ptr(x86_cpu_to_node_map); if (cpu_to_node_map) cpu_to_node_map[cpu] = node; else per_cpu(x86_cpu_to_node_map, cpu) = node; } * Add a flag "arch_provides_topology_pointers" that indicates pointers to topology cpumask_t maps are available. Otherwise, use the function returning the cpumask_t value. This is useful if cpumask_t set size is very large to avoid copying data on to/off of the stack. * The coverage of CONFIG_DEBUG_PER_CPU_MAPS has been increased while the non-debug case has been optimized a bit. * Remove an unreferenced compiler warning in drivers/base/topology.c * Clean up #ifdef in setup.c For inclusion into sched-devel/latest tree. Based on: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git + sched-devel/latest .../mingo/linux-2.6-sched-devel.git Signed-off-by: Mike Travis <travis@sgi.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-05-12 21:21:12 +02:00
#else /* !CONFIG_DEBUG_PER_CPU_MAPS */
/* Same function but used if called before per_cpu areas are setup */
static inline int early_cpu_to_node(int cpu)
{
return early_per_cpu(x86_cpu_to_node_map, cpu);
x86: cleanup early per cpu variables/accesses v4 * Introduce a new PER_CPU macro called "EARLY_PER_CPU". This is used by some per_cpu variables that are initialized and accessed before there are per_cpu areas allocated. ["Early" in respect to per_cpu variables is "earlier than the per_cpu areas have been setup".] This patchset adds these new macros: DEFINE_EARLY_PER_CPU(_type, _name, _initvalue) EXPORT_EARLY_PER_CPU_SYMBOL(_name) DECLARE_EARLY_PER_CPU(_type, _name) early_per_cpu_ptr(_name) early_per_cpu_map(_name, _idx) early_per_cpu(_name, _cpu) The DEFINE macro defines the per_cpu variable as well as the early map and pointer. It also initializes the per_cpu variable and map elements to "_initvalue". The early_* macros provide access to the initial map (usually setup during system init) and the early pointer. This pointer is initialized to point to the early map but is then NULL'ed when the actual per_cpu areas are setup. After that the per_cpu variable is the correct access to the variable. The early_per_cpu() macro is not very efficient but does show how to access the variable if you have a function that can be called both "early" and "late". It tests the early ptr to be NULL, and if not then it's still valid. Otherwise, the per_cpu variable is used instead: #define early_per_cpu(_name, _cpu) \ (early_per_cpu_ptr(_name) ? \ early_per_cpu_ptr(_name)[_cpu] : \ per_cpu(_name, _cpu)) A better method is to actually check the pointer manually. In the case below, numa_set_node can be called both "early" and "late": void __cpuinit numa_set_node(int cpu, int node) { int *cpu_to_node_map = early_per_cpu_ptr(x86_cpu_to_node_map); if (cpu_to_node_map) cpu_to_node_map[cpu] = node; else per_cpu(x86_cpu_to_node_map, cpu) = node; } * Add a flag "arch_provides_topology_pointers" that indicates pointers to topology cpumask_t maps are available. Otherwise, use the function returning the cpumask_t value. This is useful if cpumask_t set size is very large to avoid copying data on to/off of the stack. * The coverage of CONFIG_DEBUG_PER_CPU_MAPS has been increased while the non-debug case has been optimized a bit. * Remove an unreferenced compiler warning in drivers/base/topology.c * Clean up #ifdef in setup.c For inclusion into sched-devel/latest tree. Based on: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git + sched-devel/latest .../mingo/linux-2.6-sched-devel.git Signed-off-by: Mike Travis <travis@sgi.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-05-12 21:21:12 +02:00
}
#endif /* !CONFIG_DEBUG_PER_CPU_MAPS */
/* Mappings between node number and cpus on that node. */
extern cpumask_var_t node_to_cpumask_map[MAX_NUMNODES];
#ifdef CONFIG_DEBUG_PER_CPU_MAPS
extern const struct cpumask *cpumask_of_node(int node);
#else
x86: cleanup early per cpu variables/accesses v4 * Introduce a new PER_CPU macro called "EARLY_PER_CPU". This is used by some per_cpu variables that are initialized and accessed before there are per_cpu areas allocated. ["Early" in respect to per_cpu variables is "earlier than the per_cpu areas have been setup".] This patchset adds these new macros: DEFINE_EARLY_PER_CPU(_type, _name, _initvalue) EXPORT_EARLY_PER_CPU_SYMBOL(_name) DECLARE_EARLY_PER_CPU(_type, _name) early_per_cpu_ptr(_name) early_per_cpu_map(_name, _idx) early_per_cpu(_name, _cpu) The DEFINE macro defines the per_cpu variable as well as the early map and pointer. It also initializes the per_cpu variable and map elements to "_initvalue". The early_* macros provide access to the initial map (usually setup during system init) and the early pointer. This pointer is initialized to point to the early map but is then NULL'ed when the actual per_cpu areas are setup. After that the per_cpu variable is the correct access to the variable. The early_per_cpu() macro is not very efficient but does show how to access the variable if you have a function that can be called both "early" and "late". It tests the early ptr to be NULL, and if not then it's still valid. Otherwise, the per_cpu variable is used instead: #define early_per_cpu(_name, _cpu) \ (early_per_cpu_ptr(_name) ? \ early_per_cpu_ptr(_name)[_cpu] : \ per_cpu(_name, _cpu)) A better method is to actually check the pointer manually. In the case below, numa_set_node can be called both "early" and "late": void __cpuinit numa_set_node(int cpu, int node) { int *cpu_to_node_map = early_per_cpu_ptr(x86_cpu_to_node_map); if (cpu_to_node_map) cpu_to_node_map[cpu] = node; else per_cpu(x86_cpu_to_node_map, cpu) = node; } * Add a flag "arch_provides_topology_pointers" that indicates pointers to topology cpumask_t maps are available. Otherwise, use the function returning the cpumask_t value. This is useful if cpumask_t set size is very large to avoid copying data on to/off of the stack. * The coverage of CONFIG_DEBUG_PER_CPU_MAPS has been increased while the non-debug case has been optimized a bit. * Remove an unreferenced compiler warning in drivers/base/topology.c * Clean up #ifdef in setup.c For inclusion into sched-devel/latest tree. Based on: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git + sched-devel/latest .../mingo/linux-2.6-sched-devel.git Signed-off-by: Mike Travis <travis@sgi.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-05-12 21:21:12 +02:00
/* Returns a pointer to the cpumask of CPUs on Node 'node'. */
static inline const struct cpumask *cpumask_of_node(int node)
{
return node_to_cpumask_map[node];
}
#endif
x86: cleanup early per cpu variables/accesses v4 * Introduce a new PER_CPU macro called "EARLY_PER_CPU". This is used by some per_cpu variables that are initialized and accessed before there are per_cpu areas allocated. ["Early" in respect to per_cpu variables is "earlier than the per_cpu areas have been setup".] This patchset adds these new macros: DEFINE_EARLY_PER_CPU(_type, _name, _initvalue) EXPORT_EARLY_PER_CPU_SYMBOL(_name) DECLARE_EARLY_PER_CPU(_type, _name) early_per_cpu_ptr(_name) early_per_cpu_map(_name, _idx) early_per_cpu(_name, _cpu) The DEFINE macro defines the per_cpu variable as well as the early map and pointer. It also initializes the per_cpu variable and map elements to "_initvalue". The early_* macros provide access to the initial map (usually setup during system init) and the early pointer. This pointer is initialized to point to the early map but is then NULL'ed when the actual per_cpu areas are setup. After that the per_cpu variable is the correct access to the variable. The early_per_cpu() macro is not very efficient but does show how to access the variable if you have a function that can be called both "early" and "late". It tests the early ptr to be NULL, and if not then it's still valid. Otherwise, the per_cpu variable is used instead: #define early_per_cpu(_name, _cpu) \ (early_per_cpu_ptr(_name) ? \ early_per_cpu_ptr(_name)[_cpu] : \ per_cpu(_name, _cpu)) A better method is to actually check the pointer manually. In the case below, numa_set_node can be called both "early" and "late": void __cpuinit numa_set_node(int cpu, int node) { int *cpu_to_node_map = early_per_cpu_ptr(x86_cpu_to_node_map); if (cpu_to_node_map) cpu_to_node_map[cpu] = node; else per_cpu(x86_cpu_to_node_map, cpu) = node; } * Add a flag "arch_provides_topology_pointers" that indicates pointers to topology cpumask_t maps are available. Otherwise, use the function returning the cpumask_t value. This is useful if cpumask_t set size is very large to avoid copying data on to/off of the stack. * The coverage of CONFIG_DEBUG_PER_CPU_MAPS has been increased while the non-debug case has been optimized a bit. * Remove an unreferenced compiler warning in drivers/base/topology.c * Clean up #ifdef in setup.c For inclusion into sched-devel/latest tree. Based on: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git + sched-devel/latest .../mingo/linux-2.6-sched-devel.git Signed-off-by: Mike Travis <travis@sgi.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-05-12 21:21:12 +02:00
extern void setup_node_to_cpumask_map(void);
#define pcibus_to_node(bus) __pcibus_to_node(bus)
extern int __node_distance(int, int);
#define node_distance(a, b) __node_distance(a, b)
x86: cleanup early per cpu variables/accesses v4 * Introduce a new PER_CPU macro called "EARLY_PER_CPU". This is used by some per_cpu variables that are initialized and accessed before there are per_cpu areas allocated. ["Early" in respect to per_cpu variables is "earlier than the per_cpu areas have been setup".] This patchset adds these new macros: DEFINE_EARLY_PER_CPU(_type, _name, _initvalue) EXPORT_EARLY_PER_CPU_SYMBOL(_name) DECLARE_EARLY_PER_CPU(_type, _name) early_per_cpu_ptr(_name) early_per_cpu_map(_name, _idx) early_per_cpu(_name, _cpu) The DEFINE macro defines the per_cpu variable as well as the early map and pointer. It also initializes the per_cpu variable and map elements to "_initvalue". The early_* macros provide access to the initial map (usually setup during system init) and the early pointer. This pointer is initialized to point to the early map but is then NULL'ed when the actual per_cpu areas are setup. After that the per_cpu variable is the correct access to the variable. The early_per_cpu() macro is not very efficient but does show how to access the variable if you have a function that can be called both "early" and "late". It tests the early ptr to be NULL, and if not then it's still valid. Otherwise, the per_cpu variable is used instead: #define early_per_cpu(_name, _cpu) \ (early_per_cpu_ptr(_name) ? \ early_per_cpu_ptr(_name)[_cpu] : \ per_cpu(_name, _cpu)) A better method is to actually check the pointer manually. In the case below, numa_set_node can be called both "early" and "late": void __cpuinit numa_set_node(int cpu, int node) { int *cpu_to_node_map = early_per_cpu_ptr(x86_cpu_to_node_map); if (cpu_to_node_map) cpu_to_node_map[cpu] = node; else per_cpu(x86_cpu_to_node_map, cpu) = node; } * Add a flag "arch_provides_topology_pointers" that indicates pointers to topology cpumask_t maps are available. Otherwise, use the function returning the cpumask_t value. This is useful if cpumask_t set size is very large to avoid copying data on to/off of the stack. * The coverage of CONFIG_DEBUG_PER_CPU_MAPS has been increased while the non-debug case has been optimized a bit. * Remove an unreferenced compiler warning in drivers/base/topology.c * Clean up #ifdef in setup.c For inclusion into sched-devel/latest tree. Based on: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git + sched-devel/latest .../mingo/linux-2.6-sched-devel.git Signed-off-by: Mike Travis <travis@sgi.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-05-12 21:21:12 +02:00
#else /* !CONFIG_NUMA */
static inline int numa_node_id(void)
{
return 0;
}
numa: add generic percpu var numa_node_id() implementation Rework the generic version of the numa_node_id() function to use the new generic percpu variable infrastructure. Guard the new implementation with a new config option: CONFIG_USE_PERCPU_NUMA_NODE_ID. Archs which support this new implemention will default this option to 'y' when NUMA is configured. This config option could be removed if/when all archs switch over to the generic percpu implementation of numa_node_id(). Arch support involves: 1) converting any existing per cpu variable implementations to use this implementation. x86_64 is an instance of such an arch. 2) archs that don't use a per cpu variable for numa_node_id() will need to initialize the new per cpu variable "numa_node" as cpus are brought on-line. ia64 is an example. 3) Defining USE_PERCPU_NUMA_NODE_ID in arch dependent Kconfig--e.g., when NUMA is configured. This is required because I have retained the old implementation by default to allow archs to be modified incrementally, as desired. Subsequent patches will convert x86_64 and ia64 to use this implemenation. Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Tejun Heo <tj@kernel.org> Cc: Mel Gorman <mel@csn.ul.ie> Reviewed-by: Christoph Lameter <cl@linux-foundation.org> Cc: Nick Piggin <npiggin@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Eric Whitney <eric.whitney@hp.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: <linux-arch@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-05-26 14:44:56 -07:00
/*
* indicate override:
*/
#define numa_node_id numa_node_id
static inline int early_cpu_to_node(int cpu)
{
return 0;
}
x86: cleanup early per cpu variables/accesses v4 * Introduce a new PER_CPU macro called "EARLY_PER_CPU". This is used by some per_cpu variables that are initialized and accessed before there are per_cpu areas allocated. ["Early" in respect to per_cpu variables is "earlier than the per_cpu areas have been setup".] This patchset adds these new macros: DEFINE_EARLY_PER_CPU(_type, _name, _initvalue) EXPORT_EARLY_PER_CPU_SYMBOL(_name) DECLARE_EARLY_PER_CPU(_type, _name) early_per_cpu_ptr(_name) early_per_cpu_map(_name, _idx) early_per_cpu(_name, _cpu) The DEFINE macro defines the per_cpu variable as well as the early map and pointer. It also initializes the per_cpu variable and map elements to "_initvalue". The early_* macros provide access to the initial map (usually setup during system init) and the early pointer. This pointer is initialized to point to the early map but is then NULL'ed when the actual per_cpu areas are setup. After that the per_cpu variable is the correct access to the variable. The early_per_cpu() macro is not very efficient but does show how to access the variable if you have a function that can be called both "early" and "late". It tests the early ptr to be NULL, and if not then it's still valid. Otherwise, the per_cpu variable is used instead: #define early_per_cpu(_name, _cpu) \ (early_per_cpu_ptr(_name) ? \ early_per_cpu_ptr(_name)[_cpu] : \ per_cpu(_name, _cpu)) A better method is to actually check the pointer manually. In the case below, numa_set_node can be called both "early" and "late": void __cpuinit numa_set_node(int cpu, int node) { int *cpu_to_node_map = early_per_cpu_ptr(x86_cpu_to_node_map); if (cpu_to_node_map) cpu_to_node_map[cpu] = node; else per_cpu(x86_cpu_to_node_map, cpu) = node; } * Add a flag "arch_provides_topology_pointers" that indicates pointers to topology cpumask_t maps are available. Otherwise, use the function returning the cpumask_t value. This is useful if cpumask_t set size is very large to avoid copying data on to/off of the stack. * The coverage of CONFIG_DEBUG_PER_CPU_MAPS has been increased while the non-debug case has been optimized a bit. * Remove an unreferenced compiler warning in drivers/base/topology.c * Clean up #ifdef in setup.c For inclusion into sched-devel/latest tree. Based on: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git + sched-devel/latest .../mingo/linux-2.6-sched-devel.git Signed-off-by: Mike Travis <travis@sgi.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-05-12 21:21:12 +02:00
static inline void setup_node_to_cpumask_map(void) { }
#endif
#include <asm-generic/topology.h>
x86/cpu: Provide cpu_init/parse_topology() Topology evaluation is a complete disaster and impenetrable mess. It's scattered all over the place with some vendor implementations doing early evaluation and some not. The most horrific part is the permanent overwriting of smt_max_siblings and __max_die_per_package, instead of establishing them once on the boot CPU and validating the result on the APs. The goals are: - One topology evaluation entry point - Proper sharing of pointlessly duplicated code - Proper structuring of the evaluation logic and preferences. - Evaluating important system wide information only once on the boot CPU - Making the 0xb/0x1f leaf parsing less convoluted and actually fixing the short comings of leaf 0x1f evaluation. Start to consolidate the topology evaluation code by providing the entry points for the early boot CPU evaluation and for the final parsing on the boot CPU and the APs. Move the trivial pieces into that new code: - The initialization of cpuinfo_x86::topo - The evaluation of CPUID leaf 1, which presets topo::initial_apicid - topo_apicid is set to topo::initial_apicid when invoked from early boot. When invoked for the final evaluation on the boot CPU it reads the actual APIC ID, which makes apic_get_initial_apicid() obsolete once everything is converted over. Provide a temporary helper function topo_converted() which shields off the not yet converted CPU vendors from invoking code which would break them. This shielding covers all vendor CPUs which support SMP, but not the historical pure UP ones as they only need the topology info init and eventually the initial APIC initialization. Provide two new members in cpuinfo_x86::topo to store the maximum number of SMT siblings and the number of dies per package and add them to the debugfs readout. These two members will be used to populate this information on the boot CPU and to validate the APs against it. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Juergen Gross <jgross@suse.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Tested-by: Michael Kelley <mhklinux@outlook.com> Tested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Zhang Rui <rui.zhang@intel.com> Tested-by: Wang Wendy <wendy.wang@intel.com> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20240212153624.581436579@linutronix.de
2024-02-13 22:04:01 +01:00
/* Topology information */
enum x86_topology_domains {
TOPO_SMT_DOMAIN,
TOPO_CORE_DOMAIN,
TOPO_MODULE_DOMAIN,
TOPO_TILE_DOMAIN,
TOPO_DIE_DOMAIN,
TOPO_DIEGRP_DOMAIN,
TOPO_PKG_DOMAIN,
TOPO_MAX_DOMAIN,
};
enum x86_topology_cpu_type {
TOPO_CPU_TYPE_PERFORMANCE,
TOPO_CPU_TYPE_EFFICIENCY,
TOPO_CPU_TYPE_UNKNOWN,
};
x86/cpu: Provide cpu_init/parse_topology() Topology evaluation is a complete disaster and impenetrable mess. It's scattered all over the place with some vendor implementations doing early evaluation and some not. The most horrific part is the permanent overwriting of smt_max_siblings and __max_die_per_package, instead of establishing them once on the boot CPU and validating the result on the APs. The goals are: - One topology evaluation entry point - Proper sharing of pointlessly duplicated code - Proper structuring of the evaluation logic and preferences. - Evaluating important system wide information only once on the boot CPU - Making the 0xb/0x1f leaf parsing less convoluted and actually fixing the short comings of leaf 0x1f evaluation. Start to consolidate the topology evaluation code by providing the entry points for the early boot CPU evaluation and for the final parsing on the boot CPU and the APs. Move the trivial pieces into that new code: - The initialization of cpuinfo_x86::topo - The evaluation of CPUID leaf 1, which presets topo::initial_apicid - topo_apicid is set to topo::initial_apicid when invoked from early boot. When invoked for the final evaluation on the boot CPU it reads the actual APIC ID, which makes apic_get_initial_apicid() obsolete once everything is converted over. Provide a temporary helper function topo_converted() which shields off the not yet converted CPU vendors from invoking code which would break them. This shielding covers all vendor CPUs which support SMP, but not the historical pure UP ones as they only need the topology info init and eventually the initial APIC initialization. Provide two new members in cpuinfo_x86::topo to store the maximum number of SMT siblings and the number of dies per package and add them to the debugfs readout. These two members will be used to populate this information on the boot CPU and to validate the APs against it. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Juergen Gross <jgross@suse.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Tested-by: Michael Kelley <mhklinux@outlook.com> Tested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Zhang Rui <rui.zhang@intel.com> Tested-by: Wang Wendy <wendy.wang@intel.com> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20240212153624.581436579@linutronix.de
2024-02-13 22:04:01 +01:00
struct x86_topology_system {
unsigned int dom_shifts[TOPO_MAX_DOMAIN];
unsigned int dom_size[TOPO_MAX_DOMAIN];
};
extern struct x86_topology_system x86_topo_system;
static inline unsigned int topology_get_domain_size(enum x86_topology_domains dom)
{
return x86_topo_system.dom_size[dom];
}
static inline unsigned int topology_get_domain_shift(enum x86_topology_domains dom)
{
return dom == TOPO_SMT_DOMAIN ? 0 : x86_topo_system.dom_shifts[dom - 1];
}
extern const struct cpumask *cpu_coregroup_mask(int cpu);
extern const struct cpumask *cpu_clustergroup_mask(int cpu);
#define topology_logical_package_id(cpu) (cpu_data(cpu).topo.logical_pkg_id)
#define topology_physical_package_id(cpu) (cpu_data(cpu).topo.pkg_id)
#define topology_logical_die_id(cpu) (cpu_data(cpu).topo.logical_die_id)
#define topology_logical_core_id(cpu) (cpu_data(cpu).topo.logical_core_id)
#define topology_die_id(cpu) (cpu_data(cpu).topo.die_id)
#define topology_core_id(cpu) (cpu_data(cpu).topo.core_id)
#define topology_ppin(cpu) (cpu_data(cpu).ppin)
#define topology_amd_node_id(cpu) (cpu_data(cpu).topo.amd_node_id)
extern unsigned int __max_dies_per_package;
extern unsigned int __max_logical_packages;
extern unsigned int __max_threads_per_core;
extern unsigned int __num_threads_per_package;
extern unsigned int __num_cores_per_package;
const char *get_topology_cpu_type_name(struct cpuinfo_x86 *c);
enum x86_topology_cpu_type get_topology_cpu_type(struct cpuinfo_x86 *c);
static inline unsigned int topology_max_packages(void)
{
return __max_logical_packages;
}
static inline unsigned int topology_max_dies_per_package(void)
{
return __max_dies_per_package;
}
static inline unsigned int topology_num_cores_per_package(void)
{
return __num_cores_per_package;
}
static inline unsigned int topology_num_threads_per_package(void)
{
return __num_threads_per_package;
}
#ifdef CONFIG_X86_LOCAL_APIC
int topology_get_logical_id(u32 apicid, enum x86_topology_domains at_level);
#else
static inline int topology_get_logical_id(u32 apicid, enum x86_topology_domains at_level)
{
return 0;
}
#endif
#ifdef CONFIG_SMP
#define topology_cluster_id(cpu) (cpu_data(cpu).topo.l2c_id)
#define topology_die_cpumask(cpu) (per_cpu(cpu_die_map, cpu))
#define topology_cluster_cpumask(cpu) (cpu_clustergroup_mask(cpu))
#define topology_core_cpumask(cpu) (per_cpu(cpu_core_map, cpu))
#define topology_sibling_cpumask(cpu) (per_cpu(cpu_sibling_map, cpu))
x86/topology: Create logical package id For per package oriented services we must be able to rely on the number of CPU packages to be within bounds. Create a tracking facility, which - calculates the number of possible packages depending on nr_cpu_ids after boot - makes sure that the package id is within the number of possible packages. If the apic id is outside we map it to a logical package id if there is enough space available. Provide interfaces for drivers to query the mapping and do translations from physcial to logical ids. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andi Kleen <andi.kleen@intel.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Harish Chegondi <harish.chegondi@intel.com> Cc: Jacob Pan <jacob.jun.pan@linux.intel.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Kan Liang <kan.liang@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Luis R. Rodriguez <mcgrof@suse.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Toshi Kani <toshi.kani@hp.com> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: linux-kernel@vger.kernel.org Link: http://lkml.kernel.org/r/20160222221011.541071755@linutronix.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-02-22 22:19:15 +00:00
static inline int topology_phys_to_logical_pkg(unsigned int pkg)
{
return topology_get_logical_id(pkg << x86_topo_system.dom_shifts[TOPO_PKG_DOMAIN],
TOPO_PKG_DOMAIN);
}
extern int __max_smt_threads;
static inline int topology_max_smt_threads(void)
{
return __max_smt_threads;
}
#include <linux/cpu_smt.h>
extern unsigned int __amd_nodes_per_pkg;
static inline unsigned int topology_amd_nodes_per_pkg(void)
{
return __amd_nodes_per_pkg;
}
extern struct cpumask __cpu_primary_thread_mask;
#define cpu_primary_thread_mask ((const struct cpumask *)&__cpu_primary_thread_mask)
/**
* topology_is_primary_thread - Check whether CPU is the primary SMT thread
* @cpu: CPU to check
*/
static inline bool topology_is_primary_thread(unsigned int cpu)
{
return cpumask_test_cpu(cpu, cpu_primary_thread_mask);
}
#define topology_is_primary_thread topology_is_primary_thread
#else /* CONFIG_SMP */
x86/topology: Create logical package id For per package oriented services we must be able to rely on the number of CPU packages to be within bounds. Create a tracking facility, which - calculates the number of possible packages depending on nr_cpu_ids after boot - makes sure that the package id is within the number of possible packages. If the apic id is outside we map it to a logical package id if there is enough space available. Provide interfaces for drivers to query the mapping and do translations from physcial to logical ids. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andi Kleen <andi.kleen@intel.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Harish Chegondi <harish.chegondi@intel.com> Cc: Jacob Pan <jacob.jun.pan@linux.intel.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Kan Liang <kan.liang@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Luis R. Rodriguez <mcgrof@suse.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Toshi Kani <toshi.kani@hp.com> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: linux-kernel@vger.kernel.org Link: http://lkml.kernel.org/r/20160222221011.541071755@linutronix.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-02-22 22:19:15 +00:00
static inline int topology_phys_to_logical_pkg(unsigned int pkg) { return 0; }
static inline int topology_max_smt_threads(void) { return 1; }
- Add a FRU (Field Replaceable Unit) memory poison manager which collects and manages previously encountered hw errors in order to save them to persistent storage across reboots. Previously recorded errors are "replayed" upon reboot in order to poison memory which has caused said errors in the past. The main use case is stacked, on-chip memory which cannot simply be replaced so poisoning faulty areas of it and thus making them inaccessible is the only strategy to prolong its lifetime. - Add an AMD address translation library glue which converts the reported addresses of hw errors into system physical addresses in order to be used by other subsystems like memory failure, for example. Add support for MI300 accelerators to that library. - igen6: Add support for Alder Lake-N SoC - i10nm: Add Grand Ridge support - The usual fixlets and cleanups -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmXvKHcACgkQEsHwGGHe VUo4Lg/+OwXDI1EaCDyaHJ+f6JRmNok1EGjKMVjpp71/XmE3eUjiXfCv/b0bwl3V oIXGlXpJ5RSME+9aFDWADaE3h5zAGzTwQXuKtOUQPiJ6UuCebXodm8SaIG8V8trG yaW/hhP98AoJD+fN6qzv4XWYvTG8VRQs4tdISg9FXiljTjv4mKA+sxuCu8KpfrDh Tg+9F4Rre6gyR5GaB6N7Cc0k97DM7n5yKBZZGKucv+oYzDyf6n631ZSJ2zA9NC51 CJlux917hCXI/IWrCQ2nkyfPPXxn8AaznUAA30wKgwlt8TFSdKTW+DvRA2zyuAU3 0UDHO4FezOKuzVnWkzdnKsIMAnDyTGOz3Fi2LU4mC+JHaHHmI2quSWDxp5phWBuy S+T3XHxpbSsLGEI7zxT5F9u1oAlCvYu1C7HJw+yxNSn2iCy5LoNo0H/kl/nhR8Xr FgVp8SYgQRU2Pp8vgGOibMYY/TAHX55EticKdxvBI0yY+iqoJyAbZ0fb0XyLNc7s GqoWfvrK1KQzf5/Ya1Mm//0/QTPyFmJwujMJ2eEnMRRER+23bYpGvVBBT8E1sG9s gqEJkKjmVCPt9xJTcivm96sLJ7CG36w8+r/axSqpKXdcvDG9ec8G8PRqjlo5pcvh gYevmCBIcKny1xuhALwD6Rn2mkPip7araycDx9X9nd5z1qCxBaU= =FR2l -----END PGP SIGNATURE----- Merge tag 'edac_updates_for_v6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras Pull EDAC updates from Borislav Petkov: - Add a FRU (Field Replaceable Unit) memory poison manager which collects and manages previously encountered hw errors in order to save them to persistent storage across reboots. Previously recorded errors are "replayed" upon reboot in order to poison memory which has caused said errors in the past. The main use case is stacked, on-chip memory which cannot simply be replaced so poisoning faulty areas of it and thus making them inaccessible is the only strategy to prolong its lifetime. - Add an AMD address translation library glue which converts the reported addresses of hw errors into system physical addresses in order to be used by other subsystems like memory failure, for example. Add support for MI300 accelerators to that library. - igen6: Add support for Alder Lake-N SoC - i10nm: Add Grand Ridge support - The usual fixlets and cleanups * tag 'edac_updates_for_v6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras: EDAC/versal: Convert to platform remove callback returning void RAS/AMD/FMPM: Fix off by one when unwinding on error RAS/AMD/FMPM: Add debugfs interface to print record entries RAS/AMD/FMPM: Save SPA values RAS: Export helper to get ras_debugfs_dir RAS/AMD/ATL: Fix bit overflow in denorm_addr_df4_np2() RAS: Introduce a FRU memory poison manager RAS/AMD/ATL: Add MI300 row retirement support Documentation: Move RAS section to admin-guide EDAC/versal: Make the bit position of injected errors configurable EDAC/i10nm: Add Intel Grand Ridge micro-server support EDAC/igen6: Add one more Intel Alder Lake-N SoC support RAS/AMD/ATL: Add MI300 DRAM to normalized address translation support RAS/AMD/ATL: Fix array overflow in get_logical_coh_st_fabric_id_mi300() RAS/AMD/ATL: Add MI300 support Documentation: RAS: Add index and address translation section EDAC/amd64: Use new AMD Address Translation Library RAS: Introduce AMD Address Translation Library EDAC/synopsys: Convert to devm_platform_ioremap_resource()
2024-03-11 18:14:06 -07:00
static inline unsigned int topology_amd_nodes_per_pkg(void) { return 1; }
#endif /* !CONFIG_SMP */
static inline void arch_fix_phys_package_id(int num, u32 slot)
{
}
struct pci_bus;
int x86_pci_root_bus_node(int bus);
void x86_pci_root_bus_resources(int bus, struct list_head *resources);
extern bool x86_topology_update;
x86: Enable Intel Turbo Boost Max Technology 3.0 On platforms supporting Intel Turbo Boost Max Technology 3.0, the maximum turbo frequencies of some cores in a CPU package may be higher than for the other cores in the same package. In that case, better performance (and possibly lower energy consumption as well) can be achieved by making the scheduler prefer to run tasks on the CPUs with higher max turbo frequencies. To that end, set up a core priority metric to abstract the core preferences based on the maximum turbo frequency. In that metric, the cores with higher maximum turbo frequencies are higher-priority than the other cores in the same package and that causes the scheduler to favor them when making load-balancing decisions using the asymmertic packing approach. At the same time, the priority of SMT threads with a higher CPU number is reduced so as to avoid scheduling tasks on all of the threads that belong to a favored core before all of the other cores have been given a task to run. The priority metric will be initialized by the P-state driver with the help of the sched_set_itmt_core_prio() function. The P-state driver will also determine whether or not ITMT is supported by the platform and will call sched_set_itmt_support() to indicate that. Co-developed-by: Peter Zijlstra (Intel) <peterz@infradead.org> Co-developed-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> Cc: linux-pm@vger.kernel.org Cc: peterz@infradead.org Cc: jolsa@redhat.com Cc: rjw@rjwysocki.net Cc: linux-acpi@vger.kernel.org Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Cc: bp@suse.de Link: http://lkml.kernel.org/r/cd401ccdff88f88c8349314febdc25d51f7c48f7.1479844244.git.tim.c.chen@linux.intel.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-11-22 12:23:55 -08:00
#ifdef CONFIG_SCHED_MC_PRIO
x86: Enable Intel Turbo Boost Max Technology 3.0 On platforms supporting Intel Turbo Boost Max Technology 3.0, the maximum turbo frequencies of some cores in a CPU package may be higher than for the other cores in the same package. In that case, better performance (and possibly lower energy consumption as well) can be achieved by making the scheduler prefer to run tasks on the CPUs with higher max turbo frequencies. To that end, set up a core priority metric to abstract the core preferences based on the maximum turbo frequency. In that metric, the cores with higher maximum turbo frequencies are higher-priority than the other cores in the same package and that causes the scheduler to favor them when making load-balancing decisions using the asymmertic packing approach. At the same time, the priority of SMT threads with a higher CPU number is reduced so as to avoid scheduling tasks on all of the threads that belong to a favored core before all of the other cores have been given a task to run. The priority metric will be initialized by the P-state driver with the help of the sched_set_itmt_core_prio() function. The P-state driver will also determine whether or not ITMT is supported by the platform and will call sched_set_itmt_support() to indicate that. Co-developed-by: Peter Zijlstra (Intel) <peterz@infradead.org> Co-developed-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> Cc: linux-pm@vger.kernel.org Cc: peterz@infradead.org Cc: jolsa@redhat.com Cc: rjw@rjwysocki.net Cc: linux-acpi@vger.kernel.org Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Cc: bp@suse.de Link: http://lkml.kernel.org/r/cd401ccdff88f88c8349314febdc25d51f7c48f7.1479844244.git.tim.c.chen@linux.intel.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-11-22 12:23:55 -08:00
#include <asm/percpu.h>
DECLARE_PER_CPU_READ_MOSTLY(int, sched_core_priority);
extern bool __read_mostly sysctl_sched_itmt_enabled;
x86: Enable Intel Turbo Boost Max Technology 3.0 On platforms supporting Intel Turbo Boost Max Technology 3.0, the maximum turbo frequencies of some cores in a CPU package may be higher than for the other cores in the same package. In that case, better performance (and possibly lower energy consumption as well) can be achieved by making the scheduler prefer to run tasks on the CPUs with higher max turbo frequencies. To that end, set up a core priority metric to abstract the core preferences based on the maximum turbo frequency. In that metric, the cores with higher maximum turbo frequencies are higher-priority than the other cores in the same package and that causes the scheduler to favor them when making load-balancing decisions using the asymmertic packing approach. At the same time, the priority of SMT threads with a higher CPU number is reduced so as to avoid scheduling tasks on all of the threads that belong to a favored core before all of the other cores have been given a task to run. The priority metric will be initialized by the P-state driver with the help of the sched_set_itmt_core_prio() function. The P-state driver will also determine whether or not ITMT is supported by the platform and will call sched_set_itmt_support() to indicate that. Co-developed-by: Peter Zijlstra (Intel) <peterz@infradead.org> Co-developed-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> Cc: linux-pm@vger.kernel.org Cc: peterz@infradead.org Cc: jolsa@redhat.com Cc: rjw@rjwysocki.net Cc: linux-acpi@vger.kernel.org Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Cc: bp@suse.de Link: http://lkml.kernel.org/r/cd401ccdff88f88c8349314febdc25d51f7c48f7.1479844244.git.tim.c.chen@linux.intel.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-11-22 12:23:55 -08:00
/* Interface to set priority of a cpu */
void sched_set_itmt_core_prio(int prio, int core_cpu);
/* Interface to notify scheduler that system supports ITMT */
int sched_set_itmt_support(void);
x86: Enable Intel Turbo Boost Max Technology 3.0 On platforms supporting Intel Turbo Boost Max Technology 3.0, the maximum turbo frequencies of some cores in a CPU package may be higher than for the other cores in the same package. In that case, better performance (and possibly lower energy consumption as well) can be achieved by making the scheduler prefer to run tasks on the CPUs with higher max turbo frequencies. To that end, set up a core priority metric to abstract the core preferences based on the maximum turbo frequency. In that metric, the cores with higher maximum turbo frequencies are higher-priority than the other cores in the same package and that causes the scheduler to favor them when making load-balancing decisions using the asymmertic packing approach. At the same time, the priority of SMT threads with a higher CPU number is reduced so as to avoid scheduling tasks on all of the threads that belong to a favored core before all of the other cores have been given a task to run. The priority metric will be initialized by the P-state driver with the help of the sched_set_itmt_core_prio() function. The P-state driver will also determine whether or not ITMT is supported by the platform and will call sched_set_itmt_support() to indicate that. Co-developed-by: Peter Zijlstra (Intel) <peterz@infradead.org> Co-developed-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> Cc: linux-pm@vger.kernel.org Cc: peterz@infradead.org Cc: jolsa@redhat.com Cc: rjw@rjwysocki.net Cc: linux-acpi@vger.kernel.org Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Cc: bp@suse.de Link: http://lkml.kernel.org/r/cd401ccdff88f88c8349314febdc25d51f7c48f7.1479844244.git.tim.c.chen@linux.intel.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-11-22 12:23:55 -08:00
/* Interface to notify scheduler that system revokes ITMT support */
void sched_clear_itmt_support(void);
#else /* CONFIG_SCHED_MC_PRIO */
x86: Enable Intel Turbo Boost Max Technology 3.0 On platforms supporting Intel Turbo Boost Max Technology 3.0, the maximum turbo frequencies of some cores in a CPU package may be higher than for the other cores in the same package. In that case, better performance (and possibly lower energy consumption as well) can be achieved by making the scheduler prefer to run tasks on the CPUs with higher max turbo frequencies. To that end, set up a core priority metric to abstract the core preferences based on the maximum turbo frequency. In that metric, the cores with higher maximum turbo frequencies are higher-priority than the other cores in the same package and that causes the scheduler to favor them when making load-balancing decisions using the asymmertic packing approach. At the same time, the priority of SMT threads with a higher CPU number is reduced so as to avoid scheduling tasks on all of the threads that belong to a favored core before all of the other cores have been given a task to run. The priority metric will be initialized by the P-state driver with the help of the sched_set_itmt_core_prio() function. The P-state driver will also determine whether or not ITMT is supported by the platform and will call sched_set_itmt_support() to indicate that. Co-developed-by: Peter Zijlstra (Intel) <peterz@infradead.org> Co-developed-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> Cc: linux-pm@vger.kernel.org Cc: peterz@infradead.org Cc: jolsa@redhat.com Cc: rjw@rjwysocki.net Cc: linux-acpi@vger.kernel.org Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Cc: bp@suse.de Link: http://lkml.kernel.org/r/cd401ccdff88f88c8349314febdc25d51f7c48f7.1479844244.git.tim.c.chen@linux.intel.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-11-22 12:23:55 -08:00
#define sysctl_sched_itmt_enabled false
x86: Enable Intel Turbo Boost Max Technology 3.0 On platforms supporting Intel Turbo Boost Max Technology 3.0, the maximum turbo frequencies of some cores in a CPU package may be higher than for the other cores in the same package. In that case, better performance (and possibly lower energy consumption as well) can be achieved by making the scheduler prefer to run tasks on the CPUs with higher max turbo frequencies. To that end, set up a core priority metric to abstract the core preferences based on the maximum turbo frequency. In that metric, the cores with higher maximum turbo frequencies are higher-priority than the other cores in the same package and that causes the scheduler to favor them when making load-balancing decisions using the asymmertic packing approach. At the same time, the priority of SMT threads with a higher CPU number is reduced so as to avoid scheduling tasks on all of the threads that belong to a favored core before all of the other cores have been given a task to run. The priority metric will be initialized by the P-state driver with the help of the sched_set_itmt_core_prio() function. The P-state driver will also determine whether or not ITMT is supported by the platform and will call sched_set_itmt_support() to indicate that. Co-developed-by: Peter Zijlstra (Intel) <peterz@infradead.org> Co-developed-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> Cc: linux-pm@vger.kernel.org Cc: peterz@infradead.org Cc: jolsa@redhat.com Cc: rjw@rjwysocki.net Cc: linux-acpi@vger.kernel.org Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Cc: bp@suse.de Link: http://lkml.kernel.org/r/cd401ccdff88f88c8349314febdc25d51f7c48f7.1479844244.git.tim.c.chen@linux.intel.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-11-22 12:23:55 -08:00
static inline void sched_set_itmt_core_prio(int prio, int core_cpu)
{
}
static inline int sched_set_itmt_support(void)
x86: Enable Intel Turbo Boost Max Technology 3.0 On platforms supporting Intel Turbo Boost Max Technology 3.0, the maximum turbo frequencies of some cores in a CPU package may be higher than for the other cores in the same package. In that case, better performance (and possibly lower energy consumption as well) can be achieved by making the scheduler prefer to run tasks on the CPUs with higher max turbo frequencies. To that end, set up a core priority metric to abstract the core preferences based on the maximum turbo frequency. In that metric, the cores with higher maximum turbo frequencies are higher-priority than the other cores in the same package and that causes the scheduler to favor them when making load-balancing decisions using the asymmertic packing approach. At the same time, the priority of SMT threads with a higher CPU number is reduced so as to avoid scheduling tasks on all of the threads that belong to a favored core before all of the other cores have been given a task to run. The priority metric will be initialized by the P-state driver with the help of the sched_set_itmt_core_prio() function. The P-state driver will also determine whether or not ITMT is supported by the platform and will call sched_set_itmt_support() to indicate that. Co-developed-by: Peter Zijlstra (Intel) <peterz@infradead.org> Co-developed-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> Cc: linux-pm@vger.kernel.org Cc: peterz@infradead.org Cc: jolsa@redhat.com Cc: rjw@rjwysocki.net Cc: linux-acpi@vger.kernel.org Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Cc: bp@suse.de Link: http://lkml.kernel.org/r/cd401ccdff88f88c8349314febdc25d51f7c48f7.1479844244.git.tim.c.chen@linux.intel.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-11-22 12:23:55 -08:00
{
return 0;
x86: Enable Intel Turbo Boost Max Technology 3.0 On platforms supporting Intel Turbo Boost Max Technology 3.0, the maximum turbo frequencies of some cores in a CPU package may be higher than for the other cores in the same package. In that case, better performance (and possibly lower energy consumption as well) can be achieved by making the scheduler prefer to run tasks on the CPUs with higher max turbo frequencies. To that end, set up a core priority metric to abstract the core preferences based on the maximum turbo frequency. In that metric, the cores with higher maximum turbo frequencies are higher-priority than the other cores in the same package and that causes the scheduler to favor them when making load-balancing decisions using the asymmertic packing approach. At the same time, the priority of SMT threads with a higher CPU number is reduced so as to avoid scheduling tasks on all of the threads that belong to a favored core before all of the other cores have been given a task to run. The priority metric will be initialized by the P-state driver with the help of the sched_set_itmt_core_prio() function. The P-state driver will also determine whether or not ITMT is supported by the platform and will call sched_set_itmt_support() to indicate that. Co-developed-by: Peter Zijlstra (Intel) <peterz@infradead.org> Co-developed-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> Cc: linux-pm@vger.kernel.org Cc: peterz@infradead.org Cc: jolsa@redhat.com Cc: rjw@rjwysocki.net Cc: linux-acpi@vger.kernel.org Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Cc: bp@suse.de Link: http://lkml.kernel.org/r/cd401ccdff88f88c8349314febdc25d51f7c48f7.1479844244.git.tim.c.chen@linux.intel.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-11-22 12:23:55 -08:00
}
static inline void sched_clear_itmt_support(void)
{
}
#endif /* CONFIG_SCHED_MC_PRIO */
x86: Enable Intel Turbo Boost Max Technology 3.0 On platforms supporting Intel Turbo Boost Max Technology 3.0, the maximum turbo frequencies of some cores in a CPU package may be higher than for the other cores in the same package. In that case, better performance (and possibly lower energy consumption as well) can be achieved by making the scheduler prefer to run tasks on the CPUs with higher max turbo frequencies. To that end, set up a core priority metric to abstract the core preferences based on the maximum turbo frequency. In that metric, the cores with higher maximum turbo frequencies are higher-priority than the other cores in the same package and that causes the scheduler to favor them when making load-balancing decisions using the asymmertic packing approach. At the same time, the priority of SMT threads with a higher CPU number is reduced so as to avoid scheduling tasks on all of the threads that belong to a favored core before all of the other cores have been given a task to run. The priority metric will be initialized by the P-state driver with the help of the sched_set_itmt_core_prio() function. The P-state driver will also determine whether or not ITMT is supported by the platform and will call sched_set_itmt_support() to indicate that. Co-developed-by: Peter Zijlstra (Intel) <peterz@infradead.org> Co-developed-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> Cc: linux-pm@vger.kernel.org Cc: peterz@infradead.org Cc: jolsa@redhat.com Cc: rjw@rjwysocki.net Cc: linux-acpi@vger.kernel.org Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Cc: bp@suse.de Link: http://lkml.kernel.org/r/cd401ccdff88f88c8349314febdc25d51f7c48f7.1479844244.git.tim.c.chen@linux.intel.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-11-22 12:23:55 -08:00
x86, sched: check for counters overflow in frequency invariant accounting The product mcnt * arch_max_freq_ratio can overflows u64. For context, a large value for arch_max_freq_ratio would be 5000, corresponding to a turbo_freq/base_freq ratio of 5 (normally it's more like 1500-2000). A large increment frequency for the MPERF counter would be 5GHz (the base clock of all CPUs on the market today is less than that). With these figures, a CPU would need to go without a scheduler tick for around 8 days for the u64 overflow to happen. It is unlikely, but the check is warranted. Under similar conditions, the difference acnt of two consecutive APERF readings can overflow as well. In these circumstances is appropriate to disable frequency invariant accounting: the feature relies on measures of the clock frequency done at every scheduler tick, which need to be "fresh" to be at all meaningful. A note on i386: prior to version 5.1, the GCC compiler didn't have the builtin function __builtin_mul_overflow. In these GCC versions the macro check_mul_overflow needs __udivdi3() to do (u64)a/b, which the kernel doesn't provide. For this reason this change fails to build on i386 if GCC<5.1, and we protect the entire frequency invariant code behind CONFIG_X86_64 (special thanks to "kbuild test robot" <lkp@intel.com>). Fixes: 1567c3e3467c ("x86, sched: Add support for frequency invariance") Signed-off-by: Giovanni Gherdovich <ggherdovich@suse.cz> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Link: https://lkml.kernel.org/r/20200531182453.15254-2-ggherdovich@suse.cz
2020-05-31 20:24:51 +02:00
#if defined(CONFIG_SMP) && defined(CONFIG_X86_64)
x86, sched: Add support for frequency invariance Implement arch_scale_freq_capacity() for 'modern' x86. This function is used by the scheduler to correctly account usage in the face of DVFS. The present patch addresses Intel processors specifically and has positive performance and performance-per-watt implications for the schedutil cpufreq governor, bringing it closer to, if not on-par with, the powersave governor from the intel_pstate driver/framework. Large performance gains are obtained when the machine is lightly loaded and no regression are observed at saturation. The benchmarks with the largest gains are kernel compilation, tbench (the networking version of dbench) and shell-intensive workloads. 1. FREQUENCY INVARIANCE: MOTIVATION * Without it, a task looks larger if the CPU runs slower 2. PECULIARITIES OF X86 * freq invariance accounting requires knowing the ratio freq_curr/freq_max 2.1 CURRENT FREQUENCY * Use delta_APERF / delta_MPERF * freq_base (a.k.a "BusyMHz") 2.2 MAX FREQUENCY * It varies with time (turbo). As an approximation, we set it to a constant, i.e. 4-cores turbo frequency. 3. EFFECTS ON THE SCHEDUTIL FREQUENCY GOVERNOR * The invariant schedutil's formula has no feedback loop and reacts faster to utilization changes 4. KNOWN LIMITATIONS * In some cases tasks can't reach max util despite how hard they try 5. PERFORMANCE TESTING 5.1 MACHINES * Skylake, Broadwell, Haswell 5.2 SETUP * baseline Linux v5.2 w/ non-invariant schedutil. Tested freq_max = 1-2-3-4-8-12 active cores turbo w/ invariant schedutil, and intel_pstate/powersave 5.3 BENCHMARK RESULTS 5.3.1 NEUTRAL BENCHMARKS * NAS Parallel Benchmark (HPC), hackbench 5.3.2 NON-NEUTRAL BENCHMARKS * tbench (10-30% better), kernbench (10-15% better), shell-intensive-scripts (30-50% better) * no regressions 5.3.3 SELECTION OF DETAILED RESULTS 5.3.4 POWER CONSUMPTION, PERFORMANCE-PER-WATT * dbench (5% worse on one machine), kernbench (3% worse), tbench (5-10% better), shell-intensive-scripts (10-40% better) 6. MICROARCH'ES ADDRESSED HERE * Xeon Core before Scalable Performance processors line (Xeon Gold/Platinum etc have different MSRs semantic for querying turbo levels) 7. REFERENCES * MMTests performance testing framework, github.com/gormanm/mmtests +-------------------------------------------------------------------------+ | 1. FREQUENCY INVARIANCE: MOTIVATION +-------------------------------------------------------------------------+ For example; suppose a CPU has two frequencies: 500 and 1000 Mhz. When running a task that would consume 1/3rd of a CPU at 1000 MHz, it would appear to consume 2/3rd (or 66.6%) when running at 500 MHz, giving the false impression this CPU is almost at capacity, even though it can go faster [*]. In a nutshell, without frequency scale-invariance tasks look larger just because the CPU is running slower. [*] (footnote: this assumes a linear frequency/performance relation; which everybody knows to be false, but given realities its the best approximation we can make.) +-------------------------------------------------------------------------+ | 2. PECULIARITIES OF X86 +-------------------------------------------------------------------------+ Accounting for frequency changes in PELT signals requires the computation of the ratio freq_curr / freq_max. On x86 neither of those terms is readily available. 2.1 CURRENT FREQUENCY ==================== Since modern x86 has hardware control over the actual frequency we run at (because amongst other things, Turbo-Mode), we cannot simply use the frequency as requested through cpufreq. Instead we use the APERF/MPERF MSRs to compute the effective frequency over the recent past. Also, because reading MSRs is expensive, don't do so every time we need the value, but amortize the cost by doing it every tick. 2.2 MAX FREQUENCY ================= Obtaining freq_max is also non-trivial because at any time the hardware can provide a frequency boost to a selected subset of cores if the package has enough power to spare (eg: Turbo Boost). This means that the maximum frequency available to a given core changes with time. The approach taken in this change is to arbitrarily set freq_max to a constant value at boot. The value chosen is the "4-cores (4C) turbo frequency" on most microarchitectures, after evaluating the following candidates: * 1-core (1C) turbo frequency (the fastest turbo state available) * around base frequency (a.k.a. max P-state) * something in between, such as 4C turbo To interpret these options, consider that this is the denominator in freq_curr/freq_max, and that ratio will be used to scale PELT signals such as util_avg and load_avg. A large denominator will undershoot (util_avg looks a bit smaller than it really is), viceversa with a smaller denominator PELT signals will tend to overshoot. Given that PELT drives frequency selection in the schedutil governor, we will have: freq_max set to | effect on DVFS --------------------+------------------ 1C turbo | power efficiency (lower freq choices) base freq | performance (higher util_avg, higher freq requests) 4C turbo | a bit of both 4C turbo proves to be a good compromise in a number of benchmarks (see below). +-------------------------------------------------------------------------+ | 3. EFFECTS ON THE SCHEDUTIL FREQUENCY GOVERNOR +-------------------------------------------------------------------------+ Once an architecture implements a frequency scale-invariant utilization (the PELT signal util_avg), schedutil switches its frequency selection formula from freq_next = 1.25 * freq_curr * util [non-invariant util signal] to freq_next = 1.25 * freq_max * util [invariant util signal] where, in the second formula, freq_max is set to the 1C turbo frequency (max turbo). The advantage of the second formula, whose usage we unlock with this patch, is that freq_next doesn't depend on the current frequency in an iterative fashion, but can jump to any frequency in a single update. This absence of feedback in the formula makes it quicker to react to utilization changes and more robust against pathological instabilities. Compare it to the update formula of intel_pstate/powersave: freq_next = 1.25 * freq_max * Busy% where again freq_max is 1C turbo and Busy% is the percentage of time not spent idling (calculated with delta_MPERF / delta_TSC); essentially the same as invariant schedutil, and largely responsible for intel_pstate/powersave good reputation. The non-invariant schedutil formula is derived from the invariant one by approximating util_inv with util_raw * freq_curr / freq_max, but this has limitations. Testing shows improved performances due to better frequency selections when the machine is lightly loaded, and essentially no change in behaviour at saturation / overutilization. +-------------------------------------------------------------------------+ | 4. KNOWN LIMITATIONS +-------------------------------------------------------------------------+ It's been shown that it is possible to create pathological scenarios where a CPU-bound task cannot reach max utilization, if the normalizing factor freq_max is fixed to a constant value (see [Lelli-2018]). If freq_max is set to 4C turbo as we do here, one needs to peg at least 5 cores in a package doing some busywork, and observe that none of those task will ever reach max util (1024) because they're all running at less than the 4C turbo frequency. While this concern still applies, we believe the performance benefit of frequency scale-invariant PELT signals outweights the cost of this limitation. [Lelli-2018] https://lore.kernel.org/lkml/20180517150418.GF22493@localhost.localdomain/ +-------------------------------------------------------------------------+ | 5. PERFORMANCE TESTING +-------------------------------------------------------------------------+ 5.1 MACHINES ============ We tested the patch on three machines, with Skylake, Broadwell and Haswell CPUs. The details are below, together with the available turbo ratios as reported by the appropriate MSRs. * 8x-SKYLAKE-UMA: Single socket E3-1240 v5, Skylake 4 cores/8 threads Max EFFiciency, BASE frequency and available turbo levels (MHz): EFFIC 800 |******** BASE 3500 |*********************************** 4C 3700 |************************************* 3C 3800 |************************************** 2C 3900 |*************************************** 1C 3900 |*************************************** * 80x-BROADWELL-NUMA: Two sockets E5-2698 v4, 2x Broadwell 20 cores/40 threads Max EFFiciency, BASE frequency and available turbo levels (MHz): EFFIC 1200 |************ BASE 2200 |********************** 8C 2900 |***************************** 7C 3000 |****************************** 6C 3100 |******************************* 5C 3200 |******************************** 4C 3300 |********************************* 3C 3400 |********************************** 2C 3600 |************************************ 1C 3600 |************************************ * 48x-HASWELL-NUMA Two sockets E5-2670 v3, 2x Haswell 12 cores/24 threads Max EFFiciency, BASE frequency and available turbo levels (MHz): EFFIC 1200 |************ BASE 2300 |*********************** 12C 2600 |************************** 11C 2600 |************************** 10C 2600 |************************** 9C 2600 |************************** 8C 2600 |************************** 7C 2600 |************************** 6C 2600 |************************** 5C 2700 |*************************** 4C 2800 |**************************** 3C 2900 |***************************** 2C 3100 |******************************* 1C 3100 |******************************* 5.2 SETUP ========= * The baseline is Linux v5.2 with schedutil (non-invariant) and the intel_pstate driver in passive mode. * The rationale for choosing the various freq_max values to test have been to try all the 1-2-3-4C turbo levels (note that 1C and 2C turbo are identical on all machines), plus one more value closer to base_freq but still in the turbo range (8C turbo for both 80x-BROADWELL-NUMA and 48x-HASWELL-NUMA). * In addition we've run all tests with intel_pstate/powersave for comparison. * The filesystem is always XFS, the userspace is openSUSE Leap 15.1. * 8x-SKYLAKE-UMA is capable of HWP (Hardware-Managed P-States), so the runs with active intel_pstate on this machine use that. This gives, in terms of combinations tested on each machine: * 8x-SKYLAKE-UMA * Baseline: Linux v5.2, non-invariant schedutil, intel_pstate passive * intel_pstate active + powersave + HWP * invariant schedutil, freq_max = 1C turbo * invariant schedutil, freq_max = 3C turbo * invariant schedutil, freq_max = 4C turbo * both 80x-BROADWELL-NUMA and 48x-HASWELL-NUMA * [same as 8x-SKYLAKE-UMA, but no HWP capable] * invariant schedutil, freq_max = 8C turbo (which on 48x-HASWELL-NUMA is the same as 12C turbo, or "all cores turbo") 5.3 BENCHMARK RESULTS ===================== 5.3.1 NEUTRAL BENCHMARKS ------------------------ Tests that didn't show any measurable difference in performance on any of the test machines between non-invariant schedutil and our patch are: * NAS Parallel Benchmarks (NPB) using either MPI or openMP for IPC, any computational kernel * flexible I/O (FIO) * hackbench (using threads or processes, and using pipes or sockets) 5.3.2 NON-NEUTRAL BENCHMARKS ---------------------------- What follow are summary tables where each benchmark result is given a score. * A tilde (~) means a neutral result, i.e. no difference from baseline. * Scores are computed with the ratio result_new / result_baseline, so a tilde means a score of 1.00. * The results in the score ratio are the geometric means of results running the benchmark with different parameters (eg: for kernbench: using 1, 2, 4, ... number of processes; for pgbench: varying the number of clients, and so on). * The first three tables show higher-is-better kind of tests (i.e. measured in operations/second), the subsequent three show lower-is-better kind of tests (i.e. the workload is fixed and we measure elapsed time, think kernbench). * "gitsource" is a name we made up for the test consisting in running the entire unit tests suite of the Git SCM and measuring how long it takes. We take it as a typical example of shell-intensive serialized workload. * In the "I_PSTATE" column we have the results for intel_pstate/powersave. Other columns show invariant schedutil for different values of freq_max. 4C turbo is circled as it's the value we've chosen for the final implementation. 80x-BROADWELL-NUMA (comparison ratio; higher is better) +------+ I_PSTATE 1C 3C | 4C | 8C pgbench-ro 1.14 ~ ~ | 1.11 | 1.14 pgbench-rw ~ ~ ~ | ~ | ~ netperf-udp 1.06 ~ 1.06 | 1.05 | 1.07 netperf-tcp ~ 1.03 ~ | 1.01 | 1.02 tbench4 1.57 1.18 1.22 | 1.30 | 1.56 +------+ 8x-SKYLAKE-UMA (comparison ratio; higher is better) +------+ I_PSTATE/HWP 1C 3C | 4C | pgbench-ro ~ ~ ~ | ~ | pgbench-rw ~ ~ ~ | ~ | netperf-udp ~ ~ ~ | ~ | netperf-tcp ~ ~ ~ | ~ | tbench4 1.30 1.14 1.14 | 1.16 | +------+ 48x-HASWELL-NUMA (comparison ratio; higher is better) +------+ I_PSTATE 1C 3C | 4C | 12C pgbench-ro 1.15 ~ ~ | 1.06 | 1.16 pgbench-rw ~ ~ ~ | ~ | ~ netperf-udp 1.05 0.97 1.04 | 1.04 | 1.02 netperf-tcp 0.96 1.01 1.01 | 1.01 | 1.01 tbench4 1.50 1.05 1.13 | 1.13 | 1.25 +------+ In the table above we see that active intel_pstate is slightly better than our 4C-turbo patch (both in reference to the baseline non-invariant schedutil) on read-only pgbench and much better on tbench. Both cases are notable in which it shows that lowering our freq_max (to 8C-turbo and 12C-turbo on 80x-BROADWELL-NUMA and 48x-HASWELL-NUMA respectively) helps invariant schedutil to get closer. If we ignore active intel_pstate and focus on the comparison with baseline alone, there are several instances of double-digit performance improvement. 80x-BROADWELL-NUMA (comparison ratio; lower is better) +------+ I_PSTATE 1C 3C | 4C | 8C dbench4 1.23 0.95 0.95 | 0.95 | 0.95 kernbench 0.93 0.83 0.83 | 0.83 | 0.82 gitsource 0.98 0.49 0.49 | 0.49 | 0.48 +------+ 8x-SKYLAKE-UMA (comparison ratio; lower is better) +------+ I_PSTATE/HWP 1C 3C | 4C | dbench4 ~ ~ ~ | ~ | kernbench ~ ~ ~ | ~ | gitsource 0.92 0.55 0.55 | 0.55 | +------+ 48x-HASWELL-NUMA (comparison ratio; lower is better) +------+ I_PSTATE 1C 3C | 4C | 8C dbench4 ~ ~ ~ | ~ | ~ kernbench 0.94 0.90 0.89 | 0.90 | 0.90 gitsource 0.97 0.69 0.69 | 0.69 | 0.69 +------+ dbench is not very remarkable here, unless we notice how poorly active intel_pstate is performing on 80x-BROADWELL-NUMA: 23% regression versus non-invariant schedutil. We repeated that run getting consistent results. Out of scope for the patch at hand, but deserving future investigation. Other than that, we previously ran this campaign with Linux v5.0 and saw the patch doing better on dbench a the time. We haven't checked closely and can only speculate at this point. On the NUMA boxes kernbench gets 10-15% improvements on average; we'll see in the detailed tables that the gains concentrate on low process counts (lightly loaded machines). The test we call "gitsource" (running the git unit test suite, a long-running single-threaded shell script) appears rather spectacular in this table (gains of 30-50% depending on the machine). It is to be noted, however, that gitsource has no adjustable parameters (such as the number of jobs in kernbench, which we average over in order to get a single-number summary score) and is exactly the kind of low-parallelism workload that benefits the most from this patch. When looking at the detailed tables of kernbench or tbench4, at low process or client counts one can see similar numbers. 5.3.3 SELECTION OF DETAILED RESULTS ----------------------------------- Machine : 48x-HASWELL-NUMA Benchmark : tbench4 (i.e. dbench4 over the network, actually loopback) Varying parameter : number of clients Unit : MB/sec (higher is better) 5.2.0 vanilla (BASELINE) 5.2.0 intel_pstate 5.2.0 1C-turbo - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Hmean 1 126.73 +- 0.31% ( ) 315.91 +- 0.66% ( 149.28%) 125.03 +- 0.76% ( -1.34%) Hmean 2 258.04 +- 0.62% ( ) 614.16 +- 0.51% ( 138.01%) 269.58 +- 1.45% ( 4.47%) Hmean 4 514.30 +- 0.67% ( ) 1146.58 +- 0.54% ( 122.94%) 533.84 +- 1.99% ( 3.80%) Hmean 8 1111.38 +- 2.52% ( ) 2159.78 +- 0.38% ( 94.33%) 1359.92 +- 1.56% ( 22.36%) Hmean 16 2286.47 +- 1.36% ( ) 3338.29 +- 0.21% ( 46.00%) 2720.20 +- 0.52% ( 18.97%) Hmean 32 4704.84 +- 0.35% ( ) 4759.03 +- 0.43% ( 1.15%) 4774.48 +- 0.30% ( 1.48%) Hmean 64 7578.04 +- 0.27% ( ) 7533.70 +- 0.43% ( -0.59%) 7462.17 +- 0.65% ( -1.53%) Hmean 128 6998.52 +- 0.16% ( ) 6987.59 +- 0.12% ( -0.16%) 6909.17 +- 0.14% ( -1.28%) Hmean 192 6901.35 +- 0.25% ( ) 6913.16 +- 0.10% ( 0.17%) 6855.47 +- 0.21% ( -0.66%) 5.2.0 3C-turbo 5.2.0 4C-turbo 5.2.0 12C-turbo - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Hmean 1 128.43 +- 0.28% ( 1.34%) 130.64 +- 3.81% ( 3.09%) 153.71 +- 5.89% ( 21.30%) Hmean 2 311.70 +- 6.15% ( 20.79%) 281.66 +- 3.40% ( 9.15%) 305.08 +- 5.70% ( 18.23%) Hmean 4 641.98 +- 2.32% ( 24.83%) 623.88 +- 5.28% ( 21.31%) 906.84 +- 4.65% ( 76.32%) Hmean 8 1633.31 +- 1.56% ( 46.96%) 1714.16 +- 0.93% ( 54.24%) 2095.74 +- 0.47% ( 88.57%) Hmean 16 3047.24 +- 0.42% ( 33.27%) 3155.02 +- 0.30% ( 37.99%) 3634.58 +- 0.15% ( 58.96%) Hmean 32 4734.31 +- 0.60% ( 0.63%) 4804.38 +- 0.23% ( 2.12%) 4674.62 +- 0.27% ( -0.64%) Hmean 64 7699.74 +- 0.35% ( 1.61%) 7499.72 +- 0.34% ( -1.03%) 7659.03 +- 0.25% ( 1.07%) Hmean 128 6935.18 +- 0.15% ( -0.91%) 6942.54 +- 0.10% ( -0.80%) 7004.85 +- 0.12% ( 0.09%) Hmean 192 6901.62 +- 0.12% ( 0.00%) 6856.93 +- 0.10% ( -0.64%) 6978.74 +- 0.10% ( 1.12%) This is one of the cases where the patch still can't surpass active intel_pstate, not even when freq_max is as low as 12C-turbo. Otherwise, gains are visible up to 16 clients and the saturated scenario is the same as baseline. The scores in the summary table from the previous sections are ratios of geometric means of the results over different clients, as seen in this table. Machine : 80x-BROADWELL-NUMA Benchmark : kernbench (kernel compilation) Varying parameter : number of jobs Unit : seconds (lower is better) 5.2.0 vanilla (BASELINE) 5.2.0 intel_pstate 5.2.0 1C-turbo - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Amean 2 379.68 +- 0.06% ( ) 330.20 +- 0.43% ( 13.03%) 285.93 +- 0.07% ( 24.69%) Amean 4 200.15 +- 0.24% ( ) 175.89 +- 0.22% ( 12.12%) 153.78 +- 0.25% ( 23.17%) Amean 8 106.20 +- 0.31% ( ) 95.54 +- 0.23% ( 10.03%) 86.74 +- 0.10% ( 18.32%) Amean 16 56.96 +- 1.31% ( ) 53.25 +- 1.22% ( 6.50%) 48.34 +- 1.73% ( 15.13%) Amean 32 34.80 +- 2.46% ( ) 33.81 +- 0.77% ( 2.83%) 30.28 +- 1.59% ( 12.99%) Amean 64 26.11 +- 1.63% ( ) 25.04 +- 1.07% ( 4.10%) 22.41 +- 2.37% ( 14.16%) Amean 128 24.80 +- 1.36% ( ) 23.57 +- 1.23% ( 4.93%) 21.44 +- 1.37% ( 13.55%) Amean 160 24.85 +- 0.56% ( ) 23.85 +- 1.17% ( 4.06%) 21.25 +- 1.12% ( 14.49%) 5.2.0 3C-turbo 5.2.0 4C-turbo 5.2.0 8C-turbo - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Amean 2 284.08 +- 0.13% ( 25.18%) 283.96 +- 0.51% ( 25.21%) 285.05 +- 0.21% ( 24.92%) Amean 4 153.18 +- 0.22% ( 23.47%) 154.70 +- 1.64% ( 22.71%) 153.64 +- 0.30% ( 23.24%) Amean 8 87.06 +- 0.28% ( 18.02%) 86.77 +- 0.46% ( 18.29%) 86.78 +- 0.22% ( 18.28%) Amean 16 48.03 +- 0.93% ( 15.68%) 47.75 +- 1.99% ( 16.17%) 47.52 +- 1.61% ( 16.57%) Amean 32 30.23 +- 1.20% ( 13.14%) 30.08 +- 1.67% ( 13.57%) 30.07 +- 1.67% ( 13.60%) Amean 64 22.59 +- 2.02% ( 13.50%) 22.63 +- 0.81% ( 13.32%) 22.42 +- 0.76% ( 14.12%) Amean 128 21.37 +- 0.67% ( 13.82%) 21.31 +- 1.15% ( 14.07%) 21.17 +- 1.93% ( 14.63%) Amean 160 21.68 +- 0.57% ( 12.76%) 21.18 +- 1.74% ( 14.77%) 21.22 +- 1.00% ( 14.61%) The patch outperform active intel_pstate (and baseline) by a considerable margin; the summary table from the previous section says 4C turbo and active intel_pstate are 0.83 and 0.93 against baseline respectively, so 4C turbo is 0.83/0.93=0.89 against intel_pstate (~10% better on average). There is no noticeable difference with regard to the value of freq_max. Machine : 8x-SKYLAKE-UMA Benchmark : gitsource (time to run the git unit test suite) Varying parameter : none Unit : seconds (lower is better) 5.2.0 vanilla 5.2.0 intel_pstate/hwp 5.2.0 1C-turbo - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Amean 858.85 +- 1.16% ( ) 791.94 +- 0.21% ( 7.79%) 474.95 ( 44.70%) 5.2.0 3C-turbo 5.2.0 4C-turbo - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Amean 475.26 +- 0.20% ( 44.66%) 474.34 +- 0.13% ( 44.77%) In this test, which is of interest as representing shell-intensive (i.e. fork-intensive) serialized workloads, invariant schedutil outperforms intel_pstate/powersave by a whopping 40% margin. 5.3.4 POWER CONSUMPTION, PERFORMANCE-PER-WATT --------------------------------------------- The following table shows average power consumption in watt for each benchmark. Data comes from turbostat (package average), which in turn is read from the RAPL interface on CPUs. We know the patch affects CPU frequencies so it's reasonable to ignore other power consumers (such as memory or I/O). Also, we don't have a power meter available in the lab so RAPL is the best we have. turbostat sampled average power every 10 seconds for the entire duration of each benchmark. We took all those values and averaged them (i.e. with don't have detail on a per-parameter granularity, only on whole benchmarks). 80x-BROADWELL-NUMA (power consumption, watts) +--------+ BASELINE I_PSTATE 1C 3C | 4C | 8C pgbench-ro 130.01 142.77 131.11 132.45 | 134.65 | 136.84 pgbench-rw 68.30 60.83 71.45 71.70 | 71.65 | 72.54 dbench4 90.25 59.06 101.43 99.89 | 101.10 | 102.94 netperf-udp 65.70 69.81 66.02 68.03 | 68.27 | 68.95 netperf-tcp 88.08 87.96 88.97 88.89 | 88.85 | 88.20 tbench4 142.32 176.73 153.02 163.91 | 165.58 | 176.07 kernbench 92.94 101.95 114.91 115.47 | 115.52 | 115.10 gitsource 40.92 41.87 75.14 75.20 | 75.40 | 75.70 +--------+ 8x-SKYLAKE-UMA (power consumption, watts) +--------+ BASELINE I_PSTATE/HWP 1C 3C | 4C | pgbench-ro 46.49 46.68 46.56 46.59 | 46.52 | pgbench-rw 29.34 31.38 30.98 31.00 | 31.00 | dbench4 27.28 27.37 27.49 27.41 | 27.38 | netperf-udp 22.33 22.41 22.36 22.35 | 22.36 | netperf-tcp 27.29 27.29 27.30 27.31 | 27.33 | tbench4 41.13 45.61 43.10 43.33 | 43.56 | kernbench 42.56 42.63 43.01 43.01 | 43.01 | gitsource 13.32 13.69 17.33 17.30 | 17.35 | +--------+ 48x-HASWELL-NUMA (power consumption, watts) +--------+ BASELINE I_PSTATE 1C 3C | 4C | 12C pgbench-ro 128.84 136.04 129.87 132.43 | 132.30 | 134.86 pgbench-rw 37.68 37.92 37.17 37.74 | 37.73 | 37.31 dbench4 28.56 28.73 28.60 28.73 | 28.70 | 28.79 netperf-udp 56.70 60.44 56.79 57.42 | 57.54 | 57.52 netperf-tcp 75.49 75.27 75.87 76.02 | 76.01 | 75.95 tbench4 115.44 139.51 119.53 123.07 | 123.97 | 130.22 kernbench 83.23 91.55 95.58 95.69 | 95.72 | 96.04 gitsource 36.79 36.99 39.99 40.34 | 40.35 | 40.23 +--------+ A lower power consumption isn't necessarily better, it depends on what is done with that energy. Here are tables with the ratio of performance-per-watt on each machine and benchmark. Higher is always better; a tilde (~) means a neutral ratio (i.e. 1.00). 80x-BROADWELL-NUMA (performance-per-watt ratios; higher is better) +------+ I_PSTATE 1C 3C | 4C | 8C pgbench-ro 1.04 1.06 0.94 | 1.07 | 1.08 pgbench-rw 1.10 0.97 0.96 | 0.96 | 0.97 dbench4 1.24 0.94 0.95 | 0.94 | 0.92 netperf-udp ~ 1.02 1.02 | ~ | 1.02 netperf-tcp ~ 1.02 ~ | ~ | 1.02 tbench4 1.26 1.10 1.06 | 1.12 | 1.26 kernbench 0.98 0.97 0.97 | 0.97 | 0.98 gitsource ~ 1.11 1.11 | 1.11 | 1.13 +------+ 8x-SKYLAKE-UMA (performance-per-watt ratios; higher is better) +------+ I_PSTATE/HWP 1C 3C | 4C | pgbench-ro ~ ~ ~ | ~ | pgbench-rw 0.95 0.97 0.96 | 0.96 | dbench4 ~ ~ ~ | ~ | netperf-udp ~ ~ ~ | ~ | netperf-tcp ~ ~ ~ | ~ | tbench4 1.17 1.09 1.08 | 1.10 | kernbench ~ ~ ~ | ~ | gitsource 1.06 1.40 1.40 | 1.40 | +------+ 48x-HASWELL-NUMA (performance-per-watt ratios; higher is better) +------+ I_PSTATE 1C 3C | 4C | 12C pgbench-ro 1.09 ~ 1.09 | 1.03 | 1.11 pgbench-rw ~ 0.86 ~ | ~ | 0.86 dbench4 ~ 1.02 1.02 | 1.02 | ~ netperf-udp ~ 0.97 1.03 | 1.02 | ~ netperf-tcp 0.96 ~ ~ | ~ | ~ tbench4 1.24 ~ 1.06 | 1.05 | 1.11 kernbench 0.97 0.97 0.98 | 0.97 | 0.96 gitsource 1.03 1.33 1.32 | 1.32 | 1.33 +------+ These results are overall pleasing: in plenty of cases we observe performance-per-watt improvements. The few regressions (read/write pgbench and dbench on the Broadwell machine) are of small magnitude. kernbench loses a few percentage points (it has a 10-15% performance improvement, but apparently the increase in power consumption is larger than that). tbench4 and gitsource, which benefit the most from the patch, keep a positive score in this table which is a welcome surprise; that suggests that in those particular workloads the non-invariant schedutil (and active intel_pstate, too) makes some rather suboptimal frequency selections. +-------------------------------------------------------------------------+ | 6. MICROARCH'ES ADDRESSED HERE +-------------------------------------------------------------------------+ The patch addresses Xeon Core processors that use MSR_PLATFORM_INFO and MSR_TURBO_RATIO_LIMIT to advertise their base frequency and turbo frequencies respectively. This excludes the recent Xeon Scalable Performance processors line (Xeon Gold, Platinum etc) whose MSRs have to be parsed differently. Subsequent patches will address: * Xeon Scalable Performance processors and Atom Goldmont/Goldmont Plus * Xeon Phi (Knights Landing, Knights Mill) * Atom Silvermont +-------------------------------------------------------------------------+ | 7. REFERENCES +-------------------------------------------------------------------------+ Tests have been run with the help of the MMTests performance testing framework, see github.com/gormanm/mmtests. The configuration file names for the benchmark used are: db-pgbench-timed-ro-small-xfs db-pgbench-timed-rw-small-xfs io-dbench4-async-xfs network-netperf-unbound network-tbench scheduler-unbound workload-kerndevel-xfs workload-shellscripts-xfs hpc-nas-c-class-mpi-full-xfs hpc-nas-c-class-omp-full All those benchmarks are generally available on the web: pgbench: https://www.postgresql.org/docs/10/pgbench.html netperf: https://hewlettpackard.github.io/netperf/ dbench/tbench: https://dbench.samba.org/ gitsource: git unit test suite, github.com/git/git NAS Parallel Benchmarks: https://www.nas.nasa.gov/publications/npb.html hackbench: https://people.redhat.com/mingo/cfs-scheduler/tools/hackbench.c Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Giovanni Gherdovich <ggherdovich@suse.cz> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Doug Smythies <dsmythies@telus.net> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Link: https://lkml.kernel.org/r/20200122151617.531-2-ggherdovich@suse.cz
2020-01-22 16:16:12 +01:00
#include <asm/cpufeature.h>
DECLARE_STATIC_KEY_FALSE(arch_scale_freq_key);
#define arch_scale_freq_invariant() static_branch_likely(&arch_scale_freq_key)
DECLARE_PER_CPU(unsigned long, arch_freq_scale);
static inline long arch_scale_freq_capacity(int cpu)
{
return per_cpu(arch_freq_scale, cpu);
}
#define arch_scale_freq_capacity arch_scale_freq_capacity
x86/sched: Add basic support for CPU capacity scaling In order be able to compute the sizes of tasks consistently across all CPUs in a hybrid system, it is necessary to provide CPU capacity scaling information to the scheduler via arch_scale_cpu_capacity(). Moreover, the value returned by arch_scale_freq_capacity() for the given CPU must correspond to the arch_scale_cpu_capacity() return value for it, or utilization computations will be inaccurate. Add support for it through per-CPU variables holding the capacity and maximum-to-base frequency ratio (times SCHED_CAPACITY_SCALE) that will be returned by arch_scale_cpu_capacity() and used by scale_freq_tick() to compute arch_freq_scale for the current CPU, respectively. In order to avoid adding measurable overhead for non-hybrid x86 systems, which are the vast majority in the field, whether or not the new hybrid CPU capacity scaling will be in effect is controlled by a static key. This static key is set by calling arch_enable_hybrid_capacity_scale() which also allocates memory for the per-CPU data and initializes it. Next, arch_set_cpu_capacity() is used to set the per-CPU variables mentioned above for each CPU and arch_rebuild_sched_domains() needs to be called for the scheduler to realize that capacity-aware scheduling can be used going forward. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com> Tested-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com> # scale invariance Link: https://patch.msgid.link/10523497.nUPlyArG6x@rjwysocki.net [ rjw: Added parens to function kerneldoc comments ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2024-08-28 13:47:25 +02:00
bool arch_enable_hybrid_capacity_scale(void);
void arch_set_cpu_capacity(int cpu, unsigned long cap, unsigned long max_cap,
unsigned long cap_freq, unsigned long base_freq);
unsigned long arch_scale_cpu_capacity(int cpu);
#define arch_scale_cpu_capacity arch_scale_cpu_capacity
extern void arch_set_max_freq_ratio(bool turbo_disabled);
extern void freq_invariance_set_perf_ratio(u64 ratio, bool turbo_disabled);
#else
x86/sched: Add basic support for CPU capacity scaling In order be able to compute the sizes of tasks consistently across all CPUs in a hybrid system, it is necessary to provide CPU capacity scaling information to the scheduler via arch_scale_cpu_capacity(). Moreover, the value returned by arch_scale_freq_capacity() for the given CPU must correspond to the arch_scale_cpu_capacity() return value for it, or utilization computations will be inaccurate. Add support for it through per-CPU variables holding the capacity and maximum-to-base frequency ratio (times SCHED_CAPACITY_SCALE) that will be returned by arch_scale_cpu_capacity() and used by scale_freq_tick() to compute arch_freq_scale for the current CPU, respectively. In order to avoid adding measurable overhead for non-hybrid x86 systems, which are the vast majority in the field, whether or not the new hybrid CPU capacity scaling will be in effect is controlled by a static key. This static key is set by calling arch_enable_hybrid_capacity_scale() which also allocates memory for the per-CPU data and initializes it. Next, arch_set_cpu_capacity() is used to set the per-CPU variables mentioned above for each CPU and arch_rebuild_sched_domains() needs to be called for the scheduler to realize that capacity-aware scheduling can be used going forward. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com> Tested-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com> # scale invariance Link: https://patch.msgid.link/10523497.nUPlyArG6x@rjwysocki.net [ rjw: Added parens to function kerneldoc comments ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2024-08-28 13:47:25 +02:00
static inline bool arch_enable_hybrid_capacity_scale(void) { return false; }
static inline void arch_set_cpu_capacity(int cpu, unsigned long cap,
unsigned long max_cap,
unsigned long cap_freq,
unsigned long base_freq) { }
static inline void arch_set_max_freq_ratio(bool turbo_disabled) { }
static inline void freq_invariance_set_perf_ratio(u64 ratio, bool turbo_disabled) { }
x86, sched: Add support for frequency invariance Implement arch_scale_freq_capacity() for 'modern' x86. This function is used by the scheduler to correctly account usage in the face of DVFS. The present patch addresses Intel processors specifically and has positive performance and performance-per-watt implications for the schedutil cpufreq governor, bringing it closer to, if not on-par with, the powersave governor from the intel_pstate driver/framework. Large performance gains are obtained when the machine is lightly loaded and no regression are observed at saturation. The benchmarks with the largest gains are kernel compilation, tbench (the networking version of dbench) and shell-intensive workloads. 1. FREQUENCY INVARIANCE: MOTIVATION * Without it, a task looks larger if the CPU runs slower 2. PECULIARITIES OF X86 * freq invariance accounting requires knowing the ratio freq_curr/freq_max 2.1 CURRENT FREQUENCY * Use delta_APERF / delta_MPERF * freq_base (a.k.a "BusyMHz") 2.2 MAX FREQUENCY * It varies with time (turbo). As an approximation, we set it to a constant, i.e. 4-cores turbo frequency. 3. EFFECTS ON THE SCHEDUTIL FREQUENCY GOVERNOR * The invariant schedutil's formula has no feedback loop and reacts faster to utilization changes 4. KNOWN LIMITATIONS * In some cases tasks can't reach max util despite how hard they try 5. PERFORMANCE TESTING 5.1 MACHINES * Skylake, Broadwell, Haswell 5.2 SETUP * baseline Linux v5.2 w/ non-invariant schedutil. Tested freq_max = 1-2-3-4-8-12 active cores turbo w/ invariant schedutil, and intel_pstate/powersave 5.3 BENCHMARK RESULTS 5.3.1 NEUTRAL BENCHMARKS * NAS Parallel Benchmark (HPC), hackbench 5.3.2 NON-NEUTRAL BENCHMARKS * tbench (10-30% better), kernbench (10-15% better), shell-intensive-scripts (30-50% better) * no regressions 5.3.3 SELECTION OF DETAILED RESULTS 5.3.4 POWER CONSUMPTION, PERFORMANCE-PER-WATT * dbench (5% worse on one machine), kernbench (3% worse), tbench (5-10% better), shell-intensive-scripts (10-40% better) 6. MICROARCH'ES ADDRESSED HERE * Xeon Core before Scalable Performance processors line (Xeon Gold/Platinum etc have different MSRs semantic for querying turbo levels) 7. REFERENCES * MMTests performance testing framework, github.com/gormanm/mmtests +-------------------------------------------------------------------------+ | 1. FREQUENCY INVARIANCE: MOTIVATION +-------------------------------------------------------------------------+ For example; suppose a CPU has two frequencies: 500 and 1000 Mhz. When running a task that would consume 1/3rd of a CPU at 1000 MHz, it would appear to consume 2/3rd (or 66.6%) when running at 500 MHz, giving the false impression this CPU is almost at capacity, even though it can go faster [*]. In a nutshell, without frequency scale-invariance tasks look larger just because the CPU is running slower. [*] (footnote: this assumes a linear frequency/performance relation; which everybody knows to be false, but given realities its the best approximation we can make.) +-------------------------------------------------------------------------+ | 2. PECULIARITIES OF X86 +-------------------------------------------------------------------------+ Accounting for frequency changes in PELT signals requires the computation of the ratio freq_curr / freq_max. On x86 neither of those terms is readily available. 2.1 CURRENT FREQUENCY ==================== Since modern x86 has hardware control over the actual frequency we run at (because amongst other things, Turbo-Mode), we cannot simply use the frequency as requested through cpufreq. Instead we use the APERF/MPERF MSRs to compute the effective frequency over the recent past. Also, because reading MSRs is expensive, don't do so every time we need the value, but amortize the cost by doing it every tick. 2.2 MAX FREQUENCY ================= Obtaining freq_max is also non-trivial because at any time the hardware can provide a frequency boost to a selected subset of cores if the package has enough power to spare (eg: Turbo Boost). This means that the maximum frequency available to a given core changes with time. The approach taken in this change is to arbitrarily set freq_max to a constant value at boot. The value chosen is the "4-cores (4C) turbo frequency" on most microarchitectures, after evaluating the following candidates: * 1-core (1C) turbo frequency (the fastest turbo state available) * around base frequency (a.k.a. max P-state) * something in between, such as 4C turbo To interpret these options, consider that this is the denominator in freq_curr/freq_max, and that ratio will be used to scale PELT signals such as util_avg and load_avg. A large denominator will undershoot (util_avg looks a bit smaller than it really is), viceversa with a smaller denominator PELT signals will tend to overshoot. Given that PELT drives frequency selection in the schedutil governor, we will have: freq_max set to | effect on DVFS --------------------+------------------ 1C turbo | power efficiency (lower freq choices) base freq | performance (higher util_avg, higher freq requests) 4C turbo | a bit of both 4C turbo proves to be a good compromise in a number of benchmarks (see below). +-------------------------------------------------------------------------+ | 3. EFFECTS ON THE SCHEDUTIL FREQUENCY GOVERNOR +-------------------------------------------------------------------------+ Once an architecture implements a frequency scale-invariant utilization (the PELT signal util_avg), schedutil switches its frequency selection formula from freq_next = 1.25 * freq_curr * util [non-invariant util signal] to freq_next = 1.25 * freq_max * util [invariant util signal] where, in the second formula, freq_max is set to the 1C turbo frequency (max turbo). The advantage of the second formula, whose usage we unlock with this patch, is that freq_next doesn't depend on the current frequency in an iterative fashion, but can jump to any frequency in a single update. This absence of feedback in the formula makes it quicker to react to utilization changes and more robust against pathological instabilities. Compare it to the update formula of intel_pstate/powersave: freq_next = 1.25 * freq_max * Busy% where again freq_max is 1C turbo and Busy% is the percentage of time not spent idling (calculated with delta_MPERF / delta_TSC); essentially the same as invariant schedutil, and largely responsible for intel_pstate/powersave good reputation. The non-invariant schedutil formula is derived from the invariant one by approximating util_inv with util_raw * freq_curr / freq_max, but this has limitations. Testing shows improved performances due to better frequency selections when the machine is lightly loaded, and essentially no change in behaviour at saturation / overutilization. +-------------------------------------------------------------------------+ | 4. KNOWN LIMITATIONS +-------------------------------------------------------------------------+ It's been shown that it is possible to create pathological scenarios where a CPU-bound task cannot reach max utilization, if the normalizing factor freq_max is fixed to a constant value (see [Lelli-2018]). If freq_max is set to 4C turbo as we do here, one needs to peg at least 5 cores in a package doing some busywork, and observe that none of those task will ever reach max util (1024) because they're all running at less than the 4C turbo frequency. While this concern still applies, we believe the performance benefit of frequency scale-invariant PELT signals outweights the cost of this limitation. [Lelli-2018] https://lore.kernel.org/lkml/20180517150418.GF22493@localhost.localdomain/ +-------------------------------------------------------------------------+ | 5. PERFORMANCE TESTING +-------------------------------------------------------------------------+ 5.1 MACHINES ============ We tested the patch on three machines, with Skylake, Broadwell and Haswell CPUs. The details are below, together with the available turbo ratios as reported by the appropriate MSRs. * 8x-SKYLAKE-UMA: Single socket E3-1240 v5, Skylake 4 cores/8 threads Max EFFiciency, BASE frequency and available turbo levels (MHz): EFFIC 800 |******** BASE 3500 |*********************************** 4C 3700 |************************************* 3C 3800 |************************************** 2C 3900 |*************************************** 1C 3900 |*************************************** * 80x-BROADWELL-NUMA: Two sockets E5-2698 v4, 2x Broadwell 20 cores/40 threads Max EFFiciency, BASE frequency and available turbo levels (MHz): EFFIC 1200 |************ BASE 2200 |********************** 8C 2900 |***************************** 7C 3000 |****************************** 6C 3100 |******************************* 5C 3200 |******************************** 4C 3300 |********************************* 3C 3400 |********************************** 2C 3600 |************************************ 1C 3600 |************************************ * 48x-HASWELL-NUMA Two sockets E5-2670 v3, 2x Haswell 12 cores/24 threads Max EFFiciency, BASE frequency and available turbo levels (MHz): EFFIC 1200 |************ BASE 2300 |*********************** 12C 2600 |************************** 11C 2600 |************************** 10C 2600 |************************** 9C 2600 |************************** 8C 2600 |************************** 7C 2600 |************************** 6C 2600 |************************** 5C 2700 |*************************** 4C 2800 |**************************** 3C 2900 |***************************** 2C 3100 |******************************* 1C 3100 |******************************* 5.2 SETUP ========= * The baseline is Linux v5.2 with schedutil (non-invariant) and the intel_pstate driver in passive mode. * The rationale for choosing the various freq_max values to test have been to try all the 1-2-3-4C turbo levels (note that 1C and 2C turbo are identical on all machines), plus one more value closer to base_freq but still in the turbo range (8C turbo for both 80x-BROADWELL-NUMA and 48x-HASWELL-NUMA). * In addition we've run all tests with intel_pstate/powersave for comparison. * The filesystem is always XFS, the userspace is openSUSE Leap 15.1. * 8x-SKYLAKE-UMA is capable of HWP (Hardware-Managed P-States), so the runs with active intel_pstate on this machine use that. This gives, in terms of combinations tested on each machine: * 8x-SKYLAKE-UMA * Baseline: Linux v5.2, non-invariant schedutil, intel_pstate passive * intel_pstate active + powersave + HWP * invariant schedutil, freq_max = 1C turbo * invariant schedutil, freq_max = 3C turbo * invariant schedutil, freq_max = 4C turbo * both 80x-BROADWELL-NUMA and 48x-HASWELL-NUMA * [same as 8x-SKYLAKE-UMA, but no HWP capable] * invariant schedutil, freq_max = 8C turbo (which on 48x-HASWELL-NUMA is the same as 12C turbo, or "all cores turbo") 5.3 BENCHMARK RESULTS ===================== 5.3.1 NEUTRAL BENCHMARKS ------------------------ Tests that didn't show any measurable difference in performance on any of the test machines between non-invariant schedutil and our patch are: * NAS Parallel Benchmarks (NPB) using either MPI or openMP for IPC, any computational kernel * flexible I/O (FIO) * hackbench (using threads or processes, and using pipes or sockets) 5.3.2 NON-NEUTRAL BENCHMARKS ---------------------------- What follow are summary tables where each benchmark result is given a score. * A tilde (~) means a neutral result, i.e. no difference from baseline. * Scores are computed with the ratio result_new / result_baseline, so a tilde means a score of 1.00. * The results in the score ratio are the geometric means of results running the benchmark with different parameters (eg: for kernbench: using 1, 2, 4, ... number of processes; for pgbench: varying the number of clients, and so on). * The first three tables show higher-is-better kind of tests (i.e. measured in operations/second), the subsequent three show lower-is-better kind of tests (i.e. the workload is fixed and we measure elapsed time, think kernbench). * "gitsource" is a name we made up for the test consisting in running the entire unit tests suite of the Git SCM and measuring how long it takes. We take it as a typical example of shell-intensive serialized workload. * In the "I_PSTATE" column we have the results for intel_pstate/powersave. Other columns show invariant schedutil for different values of freq_max. 4C turbo is circled as it's the value we've chosen for the final implementation. 80x-BROADWELL-NUMA (comparison ratio; higher is better) +------+ I_PSTATE 1C 3C | 4C | 8C pgbench-ro 1.14 ~ ~ | 1.11 | 1.14 pgbench-rw ~ ~ ~ | ~ | ~ netperf-udp 1.06 ~ 1.06 | 1.05 | 1.07 netperf-tcp ~ 1.03 ~ | 1.01 | 1.02 tbench4 1.57 1.18 1.22 | 1.30 | 1.56 +------+ 8x-SKYLAKE-UMA (comparison ratio; higher is better) +------+ I_PSTATE/HWP 1C 3C | 4C | pgbench-ro ~ ~ ~ | ~ | pgbench-rw ~ ~ ~ | ~ | netperf-udp ~ ~ ~ | ~ | netperf-tcp ~ ~ ~ | ~ | tbench4 1.30 1.14 1.14 | 1.16 | +------+ 48x-HASWELL-NUMA (comparison ratio; higher is better) +------+ I_PSTATE 1C 3C | 4C | 12C pgbench-ro 1.15 ~ ~ | 1.06 | 1.16 pgbench-rw ~ ~ ~ | ~ | ~ netperf-udp 1.05 0.97 1.04 | 1.04 | 1.02 netperf-tcp 0.96 1.01 1.01 | 1.01 | 1.01 tbench4 1.50 1.05 1.13 | 1.13 | 1.25 +------+ In the table above we see that active intel_pstate is slightly better than our 4C-turbo patch (both in reference to the baseline non-invariant schedutil) on read-only pgbench and much better on tbench. Both cases are notable in which it shows that lowering our freq_max (to 8C-turbo and 12C-turbo on 80x-BROADWELL-NUMA and 48x-HASWELL-NUMA respectively) helps invariant schedutil to get closer. If we ignore active intel_pstate and focus on the comparison with baseline alone, there are several instances of double-digit performance improvement. 80x-BROADWELL-NUMA (comparison ratio; lower is better) +------+ I_PSTATE 1C 3C | 4C | 8C dbench4 1.23 0.95 0.95 | 0.95 | 0.95 kernbench 0.93 0.83 0.83 | 0.83 | 0.82 gitsource 0.98 0.49 0.49 | 0.49 | 0.48 +------+ 8x-SKYLAKE-UMA (comparison ratio; lower is better) +------+ I_PSTATE/HWP 1C 3C | 4C | dbench4 ~ ~ ~ | ~ | kernbench ~ ~ ~ | ~ | gitsource 0.92 0.55 0.55 | 0.55 | +------+ 48x-HASWELL-NUMA (comparison ratio; lower is better) +------+ I_PSTATE 1C 3C | 4C | 8C dbench4 ~ ~ ~ | ~ | ~ kernbench 0.94 0.90 0.89 | 0.90 | 0.90 gitsource 0.97 0.69 0.69 | 0.69 | 0.69 +------+ dbench is not very remarkable here, unless we notice how poorly active intel_pstate is performing on 80x-BROADWELL-NUMA: 23% regression versus non-invariant schedutil. We repeated that run getting consistent results. Out of scope for the patch at hand, but deserving future investigation. Other than that, we previously ran this campaign with Linux v5.0 and saw the patch doing better on dbench a the time. We haven't checked closely and can only speculate at this point. On the NUMA boxes kernbench gets 10-15% improvements on average; we'll see in the detailed tables that the gains concentrate on low process counts (lightly loaded machines). The test we call "gitsource" (running the git unit test suite, a long-running single-threaded shell script) appears rather spectacular in this table (gains of 30-50% depending on the machine). It is to be noted, however, that gitsource has no adjustable parameters (such as the number of jobs in kernbench, which we average over in order to get a single-number summary score) and is exactly the kind of low-parallelism workload that benefits the most from this patch. When looking at the detailed tables of kernbench or tbench4, at low process or client counts one can see similar numbers. 5.3.3 SELECTION OF DETAILED RESULTS ----------------------------------- Machine : 48x-HASWELL-NUMA Benchmark : tbench4 (i.e. dbench4 over the network, actually loopback) Varying parameter : number of clients Unit : MB/sec (higher is better) 5.2.0 vanilla (BASELINE) 5.2.0 intel_pstate 5.2.0 1C-turbo - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Hmean 1 126.73 +- 0.31% ( ) 315.91 +- 0.66% ( 149.28%) 125.03 +- 0.76% ( -1.34%) Hmean 2 258.04 +- 0.62% ( ) 614.16 +- 0.51% ( 138.01%) 269.58 +- 1.45% ( 4.47%) Hmean 4 514.30 +- 0.67% ( ) 1146.58 +- 0.54% ( 122.94%) 533.84 +- 1.99% ( 3.80%) Hmean 8 1111.38 +- 2.52% ( ) 2159.78 +- 0.38% ( 94.33%) 1359.92 +- 1.56% ( 22.36%) Hmean 16 2286.47 +- 1.36% ( ) 3338.29 +- 0.21% ( 46.00%) 2720.20 +- 0.52% ( 18.97%) Hmean 32 4704.84 +- 0.35% ( ) 4759.03 +- 0.43% ( 1.15%) 4774.48 +- 0.30% ( 1.48%) Hmean 64 7578.04 +- 0.27% ( ) 7533.70 +- 0.43% ( -0.59%) 7462.17 +- 0.65% ( -1.53%) Hmean 128 6998.52 +- 0.16% ( ) 6987.59 +- 0.12% ( -0.16%) 6909.17 +- 0.14% ( -1.28%) Hmean 192 6901.35 +- 0.25% ( ) 6913.16 +- 0.10% ( 0.17%) 6855.47 +- 0.21% ( -0.66%) 5.2.0 3C-turbo 5.2.0 4C-turbo 5.2.0 12C-turbo - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Hmean 1 128.43 +- 0.28% ( 1.34%) 130.64 +- 3.81% ( 3.09%) 153.71 +- 5.89% ( 21.30%) Hmean 2 311.70 +- 6.15% ( 20.79%) 281.66 +- 3.40% ( 9.15%) 305.08 +- 5.70% ( 18.23%) Hmean 4 641.98 +- 2.32% ( 24.83%) 623.88 +- 5.28% ( 21.31%) 906.84 +- 4.65% ( 76.32%) Hmean 8 1633.31 +- 1.56% ( 46.96%) 1714.16 +- 0.93% ( 54.24%) 2095.74 +- 0.47% ( 88.57%) Hmean 16 3047.24 +- 0.42% ( 33.27%) 3155.02 +- 0.30% ( 37.99%) 3634.58 +- 0.15% ( 58.96%) Hmean 32 4734.31 +- 0.60% ( 0.63%) 4804.38 +- 0.23% ( 2.12%) 4674.62 +- 0.27% ( -0.64%) Hmean 64 7699.74 +- 0.35% ( 1.61%) 7499.72 +- 0.34% ( -1.03%) 7659.03 +- 0.25% ( 1.07%) Hmean 128 6935.18 +- 0.15% ( -0.91%) 6942.54 +- 0.10% ( -0.80%) 7004.85 +- 0.12% ( 0.09%) Hmean 192 6901.62 +- 0.12% ( 0.00%) 6856.93 +- 0.10% ( -0.64%) 6978.74 +- 0.10% ( 1.12%) This is one of the cases where the patch still can't surpass active intel_pstate, not even when freq_max is as low as 12C-turbo. Otherwise, gains are visible up to 16 clients and the saturated scenario is the same as baseline. The scores in the summary table from the previous sections are ratios of geometric means of the results over different clients, as seen in this table. Machine : 80x-BROADWELL-NUMA Benchmark : kernbench (kernel compilation) Varying parameter : number of jobs Unit : seconds (lower is better) 5.2.0 vanilla (BASELINE) 5.2.0 intel_pstate 5.2.0 1C-turbo - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Amean 2 379.68 +- 0.06% ( ) 330.20 +- 0.43% ( 13.03%) 285.93 +- 0.07% ( 24.69%) Amean 4 200.15 +- 0.24% ( ) 175.89 +- 0.22% ( 12.12%) 153.78 +- 0.25% ( 23.17%) Amean 8 106.20 +- 0.31% ( ) 95.54 +- 0.23% ( 10.03%) 86.74 +- 0.10% ( 18.32%) Amean 16 56.96 +- 1.31% ( ) 53.25 +- 1.22% ( 6.50%) 48.34 +- 1.73% ( 15.13%) Amean 32 34.80 +- 2.46% ( ) 33.81 +- 0.77% ( 2.83%) 30.28 +- 1.59% ( 12.99%) Amean 64 26.11 +- 1.63% ( ) 25.04 +- 1.07% ( 4.10%) 22.41 +- 2.37% ( 14.16%) Amean 128 24.80 +- 1.36% ( ) 23.57 +- 1.23% ( 4.93%) 21.44 +- 1.37% ( 13.55%) Amean 160 24.85 +- 0.56% ( ) 23.85 +- 1.17% ( 4.06%) 21.25 +- 1.12% ( 14.49%) 5.2.0 3C-turbo 5.2.0 4C-turbo 5.2.0 8C-turbo - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Amean 2 284.08 +- 0.13% ( 25.18%) 283.96 +- 0.51% ( 25.21%) 285.05 +- 0.21% ( 24.92%) Amean 4 153.18 +- 0.22% ( 23.47%) 154.70 +- 1.64% ( 22.71%) 153.64 +- 0.30% ( 23.24%) Amean 8 87.06 +- 0.28% ( 18.02%) 86.77 +- 0.46% ( 18.29%) 86.78 +- 0.22% ( 18.28%) Amean 16 48.03 +- 0.93% ( 15.68%) 47.75 +- 1.99% ( 16.17%) 47.52 +- 1.61% ( 16.57%) Amean 32 30.23 +- 1.20% ( 13.14%) 30.08 +- 1.67% ( 13.57%) 30.07 +- 1.67% ( 13.60%) Amean 64 22.59 +- 2.02% ( 13.50%) 22.63 +- 0.81% ( 13.32%) 22.42 +- 0.76% ( 14.12%) Amean 128 21.37 +- 0.67% ( 13.82%) 21.31 +- 1.15% ( 14.07%) 21.17 +- 1.93% ( 14.63%) Amean 160 21.68 +- 0.57% ( 12.76%) 21.18 +- 1.74% ( 14.77%) 21.22 +- 1.00% ( 14.61%) The patch outperform active intel_pstate (and baseline) by a considerable margin; the summary table from the previous section says 4C turbo and active intel_pstate are 0.83 and 0.93 against baseline respectively, so 4C turbo is 0.83/0.93=0.89 against intel_pstate (~10% better on average). There is no noticeable difference with regard to the value of freq_max. Machine : 8x-SKYLAKE-UMA Benchmark : gitsource (time to run the git unit test suite) Varying parameter : none Unit : seconds (lower is better) 5.2.0 vanilla 5.2.0 intel_pstate/hwp 5.2.0 1C-turbo - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Amean 858.85 +- 1.16% ( ) 791.94 +- 0.21% ( 7.79%) 474.95 ( 44.70%) 5.2.0 3C-turbo 5.2.0 4C-turbo - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Amean 475.26 +- 0.20% ( 44.66%) 474.34 +- 0.13% ( 44.77%) In this test, which is of interest as representing shell-intensive (i.e. fork-intensive) serialized workloads, invariant schedutil outperforms intel_pstate/powersave by a whopping 40% margin. 5.3.4 POWER CONSUMPTION, PERFORMANCE-PER-WATT --------------------------------------------- The following table shows average power consumption in watt for each benchmark. Data comes from turbostat (package average), which in turn is read from the RAPL interface on CPUs. We know the patch affects CPU frequencies so it's reasonable to ignore other power consumers (such as memory or I/O). Also, we don't have a power meter available in the lab so RAPL is the best we have. turbostat sampled average power every 10 seconds for the entire duration of each benchmark. We took all those values and averaged them (i.e. with don't have detail on a per-parameter granularity, only on whole benchmarks). 80x-BROADWELL-NUMA (power consumption, watts) +--------+ BASELINE I_PSTATE 1C 3C | 4C | 8C pgbench-ro 130.01 142.77 131.11 132.45 | 134.65 | 136.84 pgbench-rw 68.30 60.83 71.45 71.70 | 71.65 | 72.54 dbench4 90.25 59.06 101.43 99.89 | 101.10 | 102.94 netperf-udp 65.70 69.81 66.02 68.03 | 68.27 | 68.95 netperf-tcp 88.08 87.96 88.97 88.89 | 88.85 | 88.20 tbench4 142.32 176.73 153.02 163.91 | 165.58 | 176.07 kernbench 92.94 101.95 114.91 115.47 | 115.52 | 115.10 gitsource 40.92 41.87 75.14 75.20 | 75.40 | 75.70 +--------+ 8x-SKYLAKE-UMA (power consumption, watts) +--------+ BASELINE I_PSTATE/HWP 1C 3C | 4C | pgbench-ro 46.49 46.68 46.56 46.59 | 46.52 | pgbench-rw 29.34 31.38 30.98 31.00 | 31.00 | dbench4 27.28 27.37 27.49 27.41 | 27.38 | netperf-udp 22.33 22.41 22.36 22.35 | 22.36 | netperf-tcp 27.29 27.29 27.30 27.31 | 27.33 | tbench4 41.13 45.61 43.10 43.33 | 43.56 | kernbench 42.56 42.63 43.01 43.01 | 43.01 | gitsource 13.32 13.69 17.33 17.30 | 17.35 | +--------+ 48x-HASWELL-NUMA (power consumption, watts) +--------+ BASELINE I_PSTATE 1C 3C | 4C | 12C pgbench-ro 128.84 136.04 129.87 132.43 | 132.30 | 134.86 pgbench-rw 37.68 37.92 37.17 37.74 | 37.73 | 37.31 dbench4 28.56 28.73 28.60 28.73 | 28.70 | 28.79 netperf-udp 56.70 60.44 56.79 57.42 | 57.54 | 57.52 netperf-tcp 75.49 75.27 75.87 76.02 | 76.01 | 75.95 tbench4 115.44 139.51 119.53 123.07 | 123.97 | 130.22 kernbench 83.23 91.55 95.58 95.69 | 95.72 | 96.04 gitsource 36.79 36.99 39.99 40.34 | 40.35 | 40.23 +--------+ A lower power consumption isn't necessarily better, it depends on what is done with that energy. Here are tables with the ratio of performance-per-watt on each machine and benchmark. Higher is always better; a tilde (~) means a neutral ratio (i.e. 1.00). 80x-BROADWELL-NUMA (performance-per-watt ratios; higher is better) +------+ I_PSTATE 1C 3C | 4C | 8C pgbench-ro 1.04 1.06 0.94 | 1.07 | 1.08 pgbench-rw 1.10 0.97 0.96 | 0.96 | 0.97 dbench4 1.24 0.94 0.95 | 0.94 | 0.92 netperf-udp ~ 1.02 1.02 | ~ | 1.02 netperf-tcp ~ 1.02 ~ | ~ | 1.02 tbench4 1.26 1.10 1.06 | 1.12 | 1.26 kernbench 0.98 0.97 0.97 | 0.97 | 0.98 gitsource ~ 1.11 1.11 | 1.11 | 1.13 +------+ 8x-SKYLAKE-UMA (performance-per-watt ratios; higher is better) +------+ I_PSTATE/HWP 1C 3C | 4C | pgbench-ro ~ ~ ~ | ~ | pgbench-rw 0.95 0.97 0.96 | 0.96 | dbench4 ~ ~ ~ | ~ | netperf-udp ~ ~ ~ | ~ | netperf-tcp ~ ~ ~ | ~ | tbench4 1.17 1.09 1.08 | 1.10 | kernbench ~ ~ ~ | ~ | gitsource 1.06 1.40 1.40 | 1.40 | +------+ 48x-HASWELL-NUMA (performance-per-watt ratios; higher is better) +------+ I_PSTATE 1C 3C | 4C | 12C pgbench-ro 1.09 ~ 1.09 | 1.03 | 1.11 pgbench-rw ~ 0.86 ~ | ~ | 0.86 dbench4 ~ 1.02 1.02 | 1.02 | ~ netperf-udp ~ 0.97 1.03 | 1.02 | ~ netperf-tcp 0.96 ~ ~ | ~ | ~ tbench4 1.24 ~ 1.06 | 1.05 | 1.11 kernbench 0.97 0.97 0.98 | 0.97 | 0.96 gitsource 1.03 1.33 1.32 | 1.32 | 1.33 +------+ These results are overall pleasing: in plenty of cases we observe performance-per-watt improvements. The few regressions (read/write pgbench and dbench on the Broadwell machine) are of small magnitude. kernbench loses a few percentage points (it has a 10-15% performance improvement, but apparently the increase in power consumption is larger than that). tbench4 and gitsource, which benefit the most from the patch, keep a positive score in this table which is a welcome surprise; that suggests that in those particular workloads the non-invariant schedutil (and active intel_pstate, too) makes some rather suboptimal frequency selections. +-------------------------------------------------------------------------+ | 6. MICROARCH'ES ADDRESSED HERE +-------------------------------------------------------------------------+ The patch addresses Xeon Core processors that use MSR_PLATFORM_INFO and MSR_TURBO_RATIO_LIMIT to advertise their base frequency and turbo frequencies respectively. This excludes the recent Xeon Scalable Performance processors line (Xeon Gold, Platinum etc) whose MSRs have to be parsed differently. Subsequent patches will address: * Xeon Scalable Performance processors and Atom Goldmont/Goldmont Plus * Xeon Phi (Knights Landing, Knights Mill) * Atom Silvermont +-------------------------------------------------------------------------+ | 7. REFERENCES +-------------------------------------------------------------------------+ Tests have been run with the help of the MMTests performance testing framework, see github.com/gormanm/mmtests. The configuration file names for the benchmark used are: db-pgbench-timed-ro-small-xfs db-pgbench-timed-rw-small-xfs io-dbench4-async-xfs network-netperf-unbound network-tbench scheduler-unbound workload-kerndevel-xfs workload-shellscripts-xfs hpc-nas-c-class-mpi-full-xfs hpc-nas-c-class-omp-full All those benchmarks are generally available on the web: pgbench: https://www.postgresql.org/docs/10/pgbench.html netperf: https://hewlettpackard.github.io/netperf/ dbench/tbench: https://dbench.samba.org/ gitsource: git unit test suite, github.com/git/git NAS Parallel Benchmarks: https://www.nas.nasa.gov/publications/npb.html hackbench: https://people.redhat.com/mingo/cfs-scheduler/tools/hackbench.c Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Giovanni Gherdovich <ggherdovich@suse.cz> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Doug Smythies <dsmythies@telus.net> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Link: https://lkml.kernel.org/r/20200122151617.531-2-ggherdovich@suse.cz
2020-01-22 16:16:12 +01:00
#endif
extern void arch_scale_freq_tick(void);
#define arch_scale_freq_tick arch_scale_freq_tick
#endif /* _ASM_X86_TOPOLOGY_H */