powerpc/64s: Implement queued spinlocks and rwlocks
These have shown significantly improved performance and fairness when
spinlock contention is moderate to high on very large systems.
With this series including subsequent patches, on a 16 socket 1536
thread POWER9, a stress test such as same-file open/close from all
CPUs gets big speedups, 11620op/s aggregate with simple spinlocks vs
384158op/s (33x faster), where the difference in throughput between
the fastest and slowest thread goes from 7x to 1.4x.
Thanks to the fast path being identical in terms of atomics and
barriers (after a subsequent optimisation patch), single threaded
performance is not changed (no measurable difference).
On smaller systems, performance and fairness seems to be generally
improved. Using dbench on tmpfs as a test (that starts to run into
kernel spinlock contention), a 2-socket OpenPOWER POWER9 system was
tested with bare metal and KVM guest configurations. Results can be
found here:
https://github.com/linuxppc/issues/issues/305#issuecomment-663487453
Observations are:
- Queued spinlocks are equal when contention is insignificant, as
expected and as measured with microbenchmarks.
- When there is contention, on bare metal queued spinlocks have better
throughput and max latency at all points.
- When virtualised, queued spinlocks are slightly worse approaching
peak throughput, but significantly better throughput and max latency
at all points beyond peak, until queued spinlock maximum latency
rises when clients are 2x vCPUs.
The regressions haven't been analysed very well yet, there are a lot
of things that can be tuned, particularly the paravirtualised locking,
but the numbers already look like a good net win even on relatively
small systems.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Waiman Long <longman@redhat.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200724131423.1362108-4-npiggin@gmail.com
2020-07-24 23:14:20 +10:00
|
|
|
/* SPDX-License-Identifier: GPL-2.0 */
|
|
|
|
#ifndef _ASM_POWERPC_QSPINLOCK_H
|
|
|
|
#define _ASM_POWERPC_QSPINLOCK_H
|
|
|
|
|
2022-11-28 13:11:13 +10:00
|
|
|
#include <linux/compiler.h>
|
|
|
|
#include <asm/qspinlock_types.h>
|
2020-07-24 23:14:21 +10:00
|
|
|
#include <asm/paravirt.h>
|
powerpc/64s: Implement queued spinlocks and rwlocks
These have shown significantly improved performance and fairness when
spinlock contention is moderate to high on very large systems.
With this series including subsequent patches, on a 16 socket 1536
thread POWER9, a stress test such as same-file open/close from all
CPUs gets big speedups, 11620op/s aggregate with simple spinlocks vs
384158op/s (33x faster), where the difference in throughput between
the fastest and slowest thread goes from 7x to 1.4x.
Thanks to the fast path being identical in terms of atomics and
barriers (after a subsequent optimisation patch), single threaded
performance is not changed (no measurable difference).
On smaller systems, performance and fairness seems to be generally
improved. Using dbench on tmpfs as a test (that starts to run into
kernel spinlock contention), a 2-socket OpenPOWER POWER9 system was
tested with bare metal and KVM guest configurations. Results can be
found here:
https://github.com/linuxppc/issues/issues/305#issuecomment-663487453
Observations are:
- Queued spinlocks are equal when contention is insignificant, as
expected and as measured with microbenchmarks.
- When there is contention, on bare metal queued spinlocks have better
throughput and max latency at all points.
- When virtualised, queued spinlocks are slightly worse approaching
peak throughput, but significantly better throughput and max latency
at all points beyond peak, until queued spinlock maximum latency
rises when clients are 2x vCPUs.
The regressions haven't been analysed very well yet, there are a lot
of things that can be tuned, particularly the paravirtualised locking,
but the numbers already look like a good net win even on relatively
small systems.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Waiman Long <longman@redhat.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200724131423.1362108-4-npiggin@gmail.com
2020-07-24 23:14:20 +10:00
|
|
|
|
2022-11-26 19:59:32 +10:00
|
|
|
#ifdef CONFIG_PPC64
|
|
|
|
/*
|
|
|
|
* Use the EH=1 hint for accesses that result in the lock being acquired.
|
|
|
|
* The hardware is supposed to optimise this pattern by holding the lock
|
|
|
|
* cacheline longer, and releasing when a store to the same memory (the
|
|
|
|
* unlock) is performed.
|
|
|
|
*/
|
|
|
|
#define _Q_SPIN_EH_HINT 1
|
|
|
|
#else
|
|
|
|
#define _Q_SPIN_EH_HINT 0
|
|
|
|
#endif
|
|
|
|
|
2022-11-26 19:59:27 +10:00
|
|
|
/*
|
|
|
|
* The trylock itself may steal. This makes trylocks slightly stronger, and
|
2022-11-26 19:59:32 +10:00
|
|
|
* makes locks slightly more efficient when stealing.
|
2022-11-26 19:59:27 +10:00
|
|
|
*
|
|
|
|
* This is compile-time, so if true then there may always be stealers, so the
|
|
|
|
* nosteal paths become unused.
|
|
|
|
*/
|
|
|
|
#define _Q_SPIN_TRY_LOCK_STEAL 1
|
|
|
|
|
2022-11-26 19:59:32 +10:00
|
|
|
/*
|
|
|
|
* Put a speculation barrier after testing the lock/node and finding it
|
|
|
|
* busy. Try to prevent pointless speculation in slow paths.
|
|
|
|
*
|
|
|
|
* Slows down the lockstorm microbenchmark with no stealing, where locking
|
|
|
|
* is purely FIFO through the queue. May have more benefit in real workload
|
|
|
|
* where speculating into the wrong place could have a greater cost.
|
|
|
|
*/
|
|
|
|
#define _Q_SPIN_SPEC_BARRIER 0
|
|
|
|
|
|
|
|
#ifdef CONFIG_PPC64
|
|
|
|
/*
|
|
|
|
* Execute a miso instruction after passing the MCS lock ownership to the
|
|
|
|
* queue head. Miso is intended to make stores visible to other CPUs sooner.
|
|
|
|
*
|
|
|
|
* This seems to make the lockstorm microbenchmark nospin test go slightly
|
|
|
|
* faster on POWER10, but disable for now.
|
|
|
|
*/
|
|
|
|
#define _Q_SPIN_MISO 0
|
|
|
|
#else
|
|
|
|
#define _Q_SPIN_MISO 0
|
|
|
|
#endif
|
|
|
|
|
|
|
|
#ifdef CONFIG_PPC64
|
|
|
|
/*
|
|
|
|
* This executes miso after an unlock of the lock word, having ownership
|
|
|
|
* pass to the next CPU sooner. This will slow the uncontended path to some
|
|
|
|
* degree. Not evidence it helps yet.
|
|
|
|
*/
|
|
|
|
#define _Q_SPIN_MISO_UNLOCK 0
|
|
|
|
#else
|
|
|
|
#define _Q_SPIN_MISO_UNLOCK 0
|
|
|
|
#endif
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Seems to slow down lockstorm microbenchmark, suspect queue node just
|
|
|
|
* has to become shared again right afterwards when its waiter spins on
|
|
|
|
* the lock field.
|
|
|
|
*/
|
|
|
|
#define _Q_SPIN_PREFETCH_NEXT 0
|
|
|
|
|
2022-11-28 13:11:13 +10:00
|
|
|
static __always_inline int queued_spin_is_locked(struct qspinlock *lock)
|
2020-07-24 23:14:21 +10:00
|
|
|
{
|
2022-11-26 19:59:18 +10:00
|
|
|
return READ_ONCE(lock->val);
|
2020-07-24 23:14:21 +10:00
|
|
|
}
|
|
|
|
|
2022-11-28 13:11:13 +10:00
|
|
|
static __always_inline int queued_spin_value_unlocked(struct qspinlock lock)
|
2020-07-24 23:14:21 +10:00
|
|
|
{
|
2022-11-26 19:59:18 +10:00
|
|
|
return !lock.val;
|
2020-07-24 23:14:21 +10:00
|
|
|
}
|
|
|
|
|
2022-11-28 13:11:13 +10:00
|
|
|
static __always_inline int queued_spin_is_contended(struct qspinlock *lock)
|
2020-07-24 23:14:21 +10:00
|
|
|
{
|
2022-11-26 19:59:18 +10:00
|
|
|
return !!(READ_ONCE(lock->val) & _Q_TAIL_CPU_MASK);
|
2020-07-24 23:14:21 +10:00
|
|
|
}
|
|
|
|
|
2022-11-26 19:59:21 +10:00
|
|
|
static __always_inline u32 queued_spin_encode_locked_val(void)
|
|
|
|
{
|
|
|
|
/* XXX: make this use lock value in paca like simple spinlocks? */
|
|
|
|
return _Q_LOCKED_VAL | (smp_processor_id() << _Q_OWNER_CPU_OFFSET);
|
|
|
|
}
|
|
|
|
|
2022-11-26 19:59:27 +10:00
|
|
|
static __always_inline int __queued_spin_trylock_nosteal(struct qspinlock *lock)
|
2020-07-24 23:14:21 +10:00
|
|
|
{
|
2022-11-26 19:59:21 +10:00
|
|
|
u32 new = queued_spin_encode_locked_val();
|
2022-11-26 19:59:18 +10:00
|
|
|
u32 prev;
|
|
|
|
|
2022-11-26 19:59:27 +10:00
|
|
|
/* Trylock succeeds only when unlocked and no queued nodes */
|
2022-11-26 19:59:18 +10:00
|
|
|
asm volatile(
|
2022-11-26 19:59:27 +10:00
|
|
|
"1: lwarx %0,0,%1,%3 # __queued_spin_trylock_nosteal \n"
|
2022-11-26 19:59:18 +10:00
|
|
|
" cmpwi 0,%0,0 \n"
|
|
|
|
" bne- 2f \n"
|
|
|
|
" stwcx. %2,0,%1 \n"
|
|
|
|
" bne- 1b \n"
|
|
|
|
"\t" PPC_ACQUIRE_BARRIER " \n"
|
|
|
|
"2: \n"
|
|
|
|
: "=&r" (prev)
|
2022-11-26 19:59:21 +10:00
|
|
|
: "r" (&lock->val), "r" (new),
|
2022-11-26 19:59:32 +10:00
|
|
|
"i" (_Q_SPIN_EH_HINT)
|
2022-11-26 19:59:18 +10:00
|
|
|
: "cr0", "memory");
|
|
|
|
|
|
|
|
return likely(prev == 0);
|
2020-07-24 23:14:21 +10:00
|
|
|
}
|
|
|
|
|
2022-11-26 19:59:19 +10:00
|
|
|
static __always_inline int __queued_spin_trylock_steal(struct qspinlock *lock)
|
|
|
|
{
|
2022-11-26 19:59:21 +10:00
|
|
|
u32 new = queued_spin_encode_locked_val();
|
2022-11-26 19:59:19 +10:00
|
|
|
u32 prev, tmp;
|
|
|
|
|
|
|
|
/* Trylock may get ahead of queued nodes if it finds unlocked */
|
|
|
|
asm volatile(
|
|
|
|
"1: lwarx %0,0,%2,%5 # __queued_spin_trylock_steal \n"
|
|
|
|
" andc. %1,%0,%4 \n"
|
|
|
|
" bne- 2f \n"
|
|
|
|
" and %1,%0,%4 \n"
|
|
|
|
" or %1,%1,%3 \n"
|
|
|
|
" stwcx. %1,0,%2 \n"
|
|
|
|
" bne- 1b \n"
|
|
|
|
"\t" PPC_ACQUIRE_BARRIER " \n"
|
|
|
|
"2: \n"
|
|
|
|
: "=&r" (prev), "=&r" (tmp)
|
2022-11-26 19:59:21 +10:00
|
|
|
: "r" (&lock->val), "r" (new), "r" (_Q_TAIL_CPU_MASK),
|
2022-11-26 19:59:32 +10:00
|
|
|
"i" (_Q_SPIN_EH_HINT)
|
2022-11-26 19:59:19 +10:00
|
|
|
: "cr0", "memory");
|
|
|
|
|
|
|
|
return likely(!(prev & ~_Q_TAIL_CPU_MASK));
|
|
|
|
}
|
|
|
|
|
2022-11-26 19:59:27 +10:00
|
|
|
static __always_inline int queued_spin_trylock(struct qspinlock *lock)
|
|
|
|
{
|
|
|
|
if (!_Q_SPIN_TRY_LOCK_STEAL)
|
|
|
|
return __queued_spin_trylock_nosteal(lock);
|
|
|
|
else
|
|
|
|
return __queued_spin_trylock_steal(lock);
|
|
|
|
}
|
|
|
|
|
2022-11-28 13:11:13 +10:00
|
|
|
void queued_spin_lock_slowpath(struct qspinlock *lock);
|
|
|
|
|
|
|
|
static __always_inline void queued_spin_lock(struct qspinlock *lock)
|
2020-07-24 23:14:21 +10:00
|
|
|
{
|
2022-11-28 13:11:13 +10:00
|
|
|
if (!queued_spin_trylock(lock))
|
|
|
|
queued_spin_lock_slowpath(lock);
|
2020-07-24 23:14:21 +10:00
|
|
|
}
|
|
|
|
|
2022-11-28 13:11:13 +10:00
|
|
|
static inline void queued_spin_unlock(struct qspinlock *lock)
|
2020-07-24 23:14:21 +10:00
|
|
|
{
|
2022-11-26 19:59:17 +10:00
|
|
|
smp_store_release(&lock->locked, 0);
|
2022-11-26 19:59:32 +10:00
|
|
|
if (_Q_SPIN_MISO_UNLOCK)
|
|
|
|
asm volatile("miso" ::: "memory");
|
2020-07-24 23:14:21 +10:00
|
|
|
}
|
|
|
|
|
2022-11-28 13:11:13 +10:00
|
|
|
#define arch_spin_is_locked(l) queued_spin_is_locked(l)
|
|
|
|
#define arch_spin_is_contended(l) queued_spin_is_contended(l)
|
|
|
|
#define arch_spin_value_unlocked(l) queued_spin_value_unlocked(l)
|
|
|
|
#define arch_spin_lock(l) queued_spin_lock(l)
|
|
|
|
#define arch_spin_trylock(l) queued_spin_trylock(l)
|
|
|
|
#define arch_spin_unlock(l) queued_spin_unlock(l)
|
powerpc/qspinlock: Use generic smp_cond_load_relaxed
49a7d46a06c3 (powerpc: Implement smp_cond_load_relaxed()) added
busy-waiting pausing with a preferred SMT priority pattern, lowering
the priority (reducing decode cycles) during the whole loop slowpath.
However, data shows that while this pattern works well with simple
spinlocks, queued spinlocks benefit more being kept in medium priority,
with a cpu_relax() instead, being a low+medium combo on powerpc.
Data is from three benchmarks on a Power9: 9008-22L 64 CPUs with
2 sockets and 8 threads per core.
1. locktorture.
This is data for the lowest and most artificial/pathological level,
with increasing thread counts pounding on the lock. Metrics are total
ops/minute. Despite some small hits in the 4-8 range, scenarios are
either neutral or favorable to this patch.
+=========+==========+==========+=======+
| # tasks | vanilla | dirty | %diff |
+=========+==========+==========+=======+
| 2 | 46718565 | 48751350 | 4.35 |
+---------+----------+----------+-------+
| 4 | 51740198 | 50369082 | -2.65 |
+---------+----------+----------+-------+
| 8 | 63756510 | 62568821 | -1.86 |
+---------+----------+----------+-------+
| 16 | 67824531 | 70966546 | 4.63 |
+---------+----------+----------+-------+
| 32 | 53843519 | 61155508 | 13.58 |
+---------+----------+----------+-------+
| 64 | 53005778 | 53104412 | 0.18 |
+---------+----------+----------+-------+
| 128 | 53331980 | 54606910 | 2.39 |
+=========+==========+==========+=======+
2. sockperf (tcp throughput)
Here a client will do one-way throughput tests to a localhost server, with
increasing message sizes, dealing with the sk_lock. This patch shows to put
the performance of the qspinlock back to par with that of the simple lock:
simple-spinlock vanilla dirty
Hmean 14 73.50 ( 0.00%) 54.44 * -25.93%* 73.45 * -0.07%*
Hmean 100 654.47 ( 0.00%) 385.61 * -41.08%* 771.43 * 17.87%*
Hmean 300 2719.39 ( 0.00%) 2181.67 * -19.77%* 2666.50 * -1.94%*
Hmean 500 4400.59 ( 0.00%) 3390.77 * -22.95%* 4322.14 * -1.78%*
Hmean 850 6726.21 ( 0.00%) 5264.03 * -21.74%* 6863.12 * 2.04%*
3. dbench (tmpfs)
Configured to run with up to ncpusx8 clients, it shows both latency and
throughput metrics. For the latency, with the exception of the 64 case,
there is really nothing to go by:
vanilla dirty
Amean latency-1 1.67 ( 0.00%) 1.67 * 0.09%*
Amean latency-2 2.15 ( 0.00%) 2.08 * 3.36%*
Amean latency-4 2.50 ( 0.00%) 2.56 * -2.27%*
Amean latency-8 2.49 ( 0.00%) 2.48 * 0.31%*
Amean latency-16 2.69 ( 0.00%) 2.72 * -1.37%*
Amean latency-32 2.96 ( 0.00%) 3.04 * -2.60%*
Amean latency-64 7.78 ( 0.00%) 8.17 * -5.07%*
Amean latency-512 186.91 ( 0.00%) 186.41 * 0.27%*
For the dbench4 Throughput (misleading but traditional) there's a small
but rather constant improvement:
vanilla dirty
Hmean 1 849.13 ( 0.00%) 851.51 * 0.28%*
Hmean 2 1664.03 ( 0.00%) 1663.94 * -0.01%*
Hmean 4 3073.70 ( 0.00%) 3104.29 * 1.00%*
Hmean 8 5624.02 ( 0.00%) 5694.16 * 1.25%*
Hmean 16 9169.49 ( 0.00%) 9324.43 * 1.69%*
Hmean 32 11969.37 ( 0.00%) 12127.09 * 1.32%*
Hmean 64 15021.12 ( 0.00%) 15243.14 * 1.48%*
Hmean 512 14891.27 ( 0.00%) 15162.11 * 1.82%*
Measuring the dbench4 Per-VFS Operation latency, shows some very minor
differences within the noise level, around the 0-1% ranges.
Fixes: 49a7d46a06c3 ("powerpc: Implement smp_cond_load_relaxed()")
Acked-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210318204702.71417-1-dave@stgolabs.net
2021-03-08 17:59:50 -08:00
|
|
|
|
2022-11-28 13:11:13 +10:00
|
|
|
#ifdef CONFIG_PARAVIRT_SPINLOCKS
|
|
|
|
void pv_spinlocks_init(void);
|
|
|
|
#else
|
|
|
|
static inline void pv_spinlocks_init(void) { }
|
|
|
|
#endif
|
powerpc/64s: Implement queued spinlocks and rwlocks
These have shown significantly improved performance and fairness when
spinlock contention is moderate to high on very large systems.
With this series including subsequent patches, on a 16 socket 1536
thread POWER9, a stress test such as same-file open/close from all
CPUs gets big speedups, 11620op/s aggregate with simple spinlocks vs
384158op/s (33x faster), where the difference in throughput between
the fastest and slowest thread goes from 7x to 1.4x.
Thanks to the fast path being identical in terms of atomics and
barriers (after a subsequent optimisation patch), single threaded
performance is not changed (no measurable difference).
On smaller systems, performance and fairness seems to be generally
improved. Using dbench on tmpfs as a test (that starts to run into
kernel spinlock contention), a 2-socket OpenPOWER POWER9 system was
tested with bare metal and KVM guest configurations. Results can be
found here:
https://github.com/linuxppc/issues/issues/305#issuecomment-663487453
Observations are:
- Queued spinlocks are equal when contention is insignificant, as
expected and as measured with microbenchmarks.
- When there is contention, on bare metal queued spinlocks have better
throughput and max latency at all points.
- When virtualised, queued spinlocks are slightly worse approaching
peak throughput, but significantly better throughput and max latency
at all points beyond peak, until queued spinlock maximum latency
rises when clients are 2x vCPUs.
The regressions haven't been analysed very well yet, there are a lot
of things that can be tuned, particularly the paravirtualised locking,
but the numbers already look like a good net win even on relatively
small systems.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Waiman Long <longman@redhat.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200724131423.1362108-4-npiggin@gmail.com
2020-07-24 23:14:20 +10:00
|
|
|
|
|
|
|
#endif /* _ASM_POWERPC_QSPINLOCK_H */
|