mirror of
				git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
				synced 2025-11-01 09:13:37 +00:00 
			
		
		
		
	The ORDERING section of Documentation/atomic_t.txt can easily be read as saying that conditional atomic RMW operations that fail are ordered when those operations have the _acquire() or _release() suffixes. This is not the case, therefore update this section to make it clear that failed conditional atomic RMW operations provide no ordering. Reported-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Alan Stern <stern@rowland.harvard.edu> Cc: Will Deacon <will@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: David Howells <dhowells@redhat.com> Cc: Jade Alglave <j.alglave@ucl.ac.uk> Cc: Luc Maranget <luc.maranget@inria.fr> Cc: "Paul E. McKenney" <paulmck@kernel.org> Cc: Akira Yokosawa <akiyks@gmail.com> Cc: Daniel Lustig <dlustig@nvidia.com> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: <linux-arch@vger.kernel.org> Cc: <linux-doc@vger.kernel.org> Acked-by: Andrea Parri <parri.andrea@gmail.com> Acked-by: Mark Rutland <mark.rutland@arm.com>
		
			
				
	
	
		
			367 lines
		
	
	
	
		
			10 KiB
		
	
	
	
		
			Text
		
	
	
	
	
	
			
		
		
	
	
			367 lines
		
	
	
	
		
			10 KiB
		
	
	
	
		
			Text
		
	
	
	
	
	
 | 
						|
On atomic types (atomic_t atomic64_t and atomic_long_t).
 | 
						|
 | 
						|
The atomic type provides an interface to the architecture's means of atomic
 | 
						|
RMW operations between CPUs (atomic operations on MMIO are not supported and
 | 
						|
can lead to fatal traps on some platforms).
 | 
						|
 | 
						|
API
 | 
						|
---
 | 
						|
 | 
						|
The 'full' API consists of (atomic64_ and atomic_long_ prefixes omitted for
 | 
						|
brevity):
 | 
						|
 | 
						|
Non-RMW ops:
 | 
						|
 | 
						|
  atomic_read(), atomic_set()
 | 
						|
  atomic_read_acquire(), atomic_set_release()
 | 
						|
 | 
						|
 | 
						|
RMW atomic operations:
 | 
						|
 | 
						|
Arithmetic:
 | 
						|
 | 
						|
  atomic_{add,sub,inc,dec}()
 | 
						|
  atomic_{add,sub,inc,dec}_return{,_relaxed,_acquire,_release}()
 | 
						|
  atomic_fetch_{add,sub,inc,dec}{,_relaxed,_acquire,_release}()
 | 
						|
 | 
						|
 | 
						|
Bitwise:
 | 
						|
 | 
						|
  atomic_{and,or,xor,andnot}()
 | 
						|
  atomic_fetch_{and,or,xor,andnot}{,_relaxed,_acquire,_release}()
 | 
						|
 | 
						|
 | 
						|
Swap:
 | 
						|
 | 
						|
  atomic_xchg{,_relaxed,_acquire,_release}()
 | 
						|
  atomic_cmpxchg{,_relaxed,_acquire,_release}()
 | 
						|
  atomic_try_cmpxchg{,_relaxed,_acquire,_release}()
 | 
						|
 | 
						|
 | 
						|
Reference count (but please see refcount_t):
 | 
						|
 | 
						|
  atomic_add_unless(), atomic_inc_not_zero()
 | 
						|
  atomic_sub_and_test(), atomic_dec_and_test()
 | 
						|
 | 
						|
 | 
						|
Misc:
 | 
						|
 | 
						|
  atomic_inc_and_test(), atomic_add_negative()
 | 
						|
  atomic_dec_unless_positive(), atomic_inc_unless_negative()
 | 
						|
 | 
						|
 | 
						|
Barriers:
 | 
						|
 | 
						|
  smp_mb__{before,after}_atomic()
 | 
						|
 | 
						|
 | 
						|
TYPES (signed vs unsigned)
 | 
						|
-----
 | 
						|
 | 
						|
While atomic_t, atomic_long_t and atomic64_t use int, long and s64
 | 
						|
respectively (for hysterical raisins), the kernel uses -fno-strict-overflow
 | 
						|
(which implies -fwrapv) and defines signed overflow to behave like
 | 
						|
2s-complement.
 | 
						|
 | 
						|
Therefore, an explicitly unsigned variant of the atomic ops is strictly
 | 
						|
unnecessary and we can simply cast, there is no UB.
 | 
						|
 | 
						|
There was a bug in UBSAN prior to GCC-8 that would generate UB warnings for
 | 
						|
signed types.
 | 
						|
 | 
						|
With this we also conform to the C/C++ _Atomic behaviour and things like
 | 
						|
P1236R1.
 | 
						|
 | 
						|
 | 
						|
SEMANTICS
 | 
						|
---------
 | 
						|
 | 
						|
Non-RMW ops:
 | 
						|
 | 
						|
The non-RMW ops are (typically) regular LOADs and STOREs and are canonically
 | 
						|
implemented using READ_ONCE(), WRITE_ONCE(), smp_load_acquire() and
 | 
						|
smp_store_release() respectively. Therefore, if you find yourself only using
 | 
						|
the Non-RMW operations of atomic_t, you do not in fact need atomic_t at all
 | 
						|
and are doing it wrong.
 | 
						|
 | 
						|
A note for the implementation of atomic_set{}() is that it must not break the
 | 
						|
atomicity of the RMW ops. That is:
 | 
						|
 | 
						|
  C Atomic-RMW-ops-are-atomic-WRT-atomic_set
 | 
						|
 | 
						|
  {
 | 
						|
    atomic_t v = ATOMIC_INIT(1);
 | 
						|
  }
 | 
						|
 | 
						|
  P0(atomic_t *v)
 | 
						|
  {
 | 
						|
    (void)atomic_add_unless(v, 1, 0);
 | 
						|
  }
 | 
						|
 | 
						|
  P1(atomic_t *v)
 | 
						|
  {
 | 
						|
    atomic_set(v, 0);
 | 
						|
  }
 | 
						|
 | 
						|
  exists
 | 
						|
  (v=2)
 | 
						|
 | 
						|
In this case we would expect the atomic_set() from CPU1 to either happen
 | 
						|
before the atomic_add_unless(), in which case that latter one would no-op, or
 | 
						|
_after_ in which case we'd overwrite its result. In no case is "2" a valid
 | 
						|
outcome.
 | 
						|
 | 
						|
This is typically true on 'normal' platforms, where a regular competing STORE
 | 
						|
will invalidate a LL/SC or fail a CMPXCHG.
 | 
						|
 | 
						|
The obvious case where this is not so is when we need to implement atomic ops
 | 
						|
with a lock:
 | 
						|
 | 
						|
  CPU0						CPU1
 | 
						|
 | 
						|
  atomic_add_unless(v, 1, 0);
 | 
						|
    lock();
 | 
						|
    ret = READ_ONCE(v->counter); // == 1
 | 
						|
						atomic_set(v, 0);
 | 
						|
    if (ret != u)				  WRITE_ONCE(v->counter, 0);
 | 
						|
      WRITE_ONCE(v->counter, ret + 1);
 | 
						|
    unlock();
 | 
						|
 | 
						|
the typical solution is to then implement atomic_set{}() with atomic_xchg().
 | 
						|
 | 
						|
 | 
						|
RMW ops:
 | 
						|
 | 
						|
These come in various forms:
 | 
						|
 | 
						|
 - plain operations without return value: atomic_{}()
 | 
						|
 | 
						|
 - operations which return the modified value: atomic_{}_return()
 | 
						|
 | 
						|
   these are limited to the arithmetic operations because those are
 | 
						|
   reversible. Bitops are irreversible and therefore the modified value
 | 
						|
   is of dubious utility.
 | 
						|
 | 
						|
 - operations which return the original value: atomic_fetch_{}()
 | 
						|
 | 
						|
 - swap operations: xchg(), cmpxchg() and try_cmpxchg()
 | 
						|
 | 
						|
 - misc; the special purpose operations that are commonly used and would,
 | 
						|
   given the interface, normally be implemented using (try_)cmpxchg loops but
 | 
						|
   are time critical and can, (typically) on LL/SC architectures, be more
 | 
						|
   efficiently implemented.
 | 
						|
 | 
						|
All these operations are SMP atomic; that is, the operations (for a single
 | 
						|
atomic variable) can be fully ordered and no intermediate state is lost or
 | 
						|
visible.
 | 
						|
 | 
						|
 | 
						|
ORDERING  (go read memory-barriers.txt first)
 | 
						|
--------
 | 
						|
 | 
						|
The rule of thumb:
 | 
						|
 | 
						|
 - non-RMW operations are unordered;
 | 
						|
 | 
						|
 - RMW operations that have no return value are unordered;
 | 
						|
 | 
						|
 - RMW operations that have a return value are fully ordered;
 | 
						|
 | 
						|
 - RMW operations that are conditional are unordered on FAILURE,
 | 
						|
   otherwise the above rules apply.
 | 
						|
 | 
						|
Except of course when a successful operation has an explicit ordering like:
 | 
						|
 | 
						|
 {}_relaxed: unordered
 | 
						|
 {}_acquire: the R of the RMW (or atomic_read) is an ACQUIRE
 | 
						|
 {}_release: the W of the RMW (or atomic_set)  is a  RELEASE
 | 
						|
 | 
						|
Where 'unordered' is against other memory locations. Address dependencies are
 | 
						|
not defeated.  Conditional operations are still unordered on FAILURE.
 | 
						|
 | 
						|
Fully ordered primitives are ordered against everything prior and everything
 | 
						|
subsequent. Therefore a fully ordered primitive is like having an smp_mb()
 | 
						|
before and an smp_mb() after the primitive.
 | 
						|
 | 
						|
 | 
						|
The barriers:
 | 
						|
 | 
						|
  smp_mb__{before,after}_atomic()
 | 
						|
 | 
						|
only apply to the RMW atomic ops and can be used to augment/upgrade the
 | 
						|
ordering inherent to the op. These barriers act almost like a full smp_mb():
 | 
						|
smp_mb__before_atomic() orders all earlier accesses against the RMW op
 | 
						|
itself and all accesses following it, and smp_mb__after_atomic() orders all
 | 
						|
later accesses against the RMW op and all accesses preceding it. However,
 | 
						|
accesses between the smp_mb__{before,after}_atomic() and the RMW op are not
 | 
						|
ordered, so it is advisable to place the barrier right next to the RMW atomic
 | 
						|
op whenever possible.
 | 
						|
 | 
						|
These helper barriers exist because architectures have varying implicit
 | 
						|
ordering on their SMP atomic primitives. For example our TSO architectures
 | 
						|
provide full ordered atomics and these barriers are no-ops.
 | 
						|
 | 
						|
NOTE: when the atomic RmW ops are fully ordered, they should also imply a
 | 
						|
compiler barrier.
 | 
						|
 | 
						|
Thus:
 | 
						|
 | 
						|
  atomic_fetch_add();
 | 
						|
 | 
						|
is equivalent to:
 | 
						|
 | 
						|
  smp_mb__before_atomic();
 | 
						|
  atomic_fetch_add_relaxed();
 | 
						|
  smp_mb__after_atomic();
 | 
						|
 | 
						|
However the atomic_fetch_add() might be implemented more efficiently.
 | 
						|
 | 
						|
Further, while something like:
 | 
						|
 | 
						|
  smp_mb__before_atomic();
 | 
						|
  atomic_dec(&X);
 | 
						|
 | 
						|
is a 'typical' RELEASE pattern, the barrier is strictly stronger than
 | 
						|
a RELEASE because it orders preceding instructions against both the read
 | 
						|
and write parts of the atomic_dec(), and against all following instructions
 | 
						|
as well. Similarly, something like:
 | 
						|
 | 
						|
  atomic_inc(&X);
 | 
						|
  smp_mb__after_atomic();
 | 
						|
 | 
						|
is an ACQUIRE pattern (though very much not typical), but again the barrier is
 | 
						|
strictly stronger than ACQUIRE. As illustrated:
 | 
						|
 | 
						|
  C Atomic-RMW+mb__after_atomic-is-stronger-than-acquire
 | 
						|
 | 
						|
  {
 | 
						|
  }
 | 
						|
 | 
						|
  P0(int *x, atomic_t *y)
 | 
						|
  {
 | 
						|
    r0 = READ_ONCE(*x);
 | 
						|
    smp_rmb();
 | 
						|
    r1 = atomic_read(y);
 | 
						|
  }
 | 
						|
 | 
						|
  P1(int *x, atomic_t *y)
 | 
						|
  {
 | 
						|
    atomic_inc(y);
 | 
						|
    smp_mb__after_atomic();
 | 
						|
    WRITE_ONCE(*x, 1);
 | 
						|
  }
 | 
						|
 | 
						|
  exists
 | 
						|
  (0:r0=1 /\ 0:r1=0)
 | 
						|
 | 
						|
This should not happen; but a hypothetical atomic_inc_acquire() --
 | 
						|
(void)atomic_fetch_inc_acquire() for instance -- would allow the outcome,
 | 
						|
because it would not order the W part of the RMW against the following
 | 
						|
WRITE_ONCE.  Thus:
 | 
						|
 | 
						|
  P0			P1
 | 
						|
 | 
						|
			t = LL.acq *y (0)
 | 
						|
			t++;
 | 
						|
			*x = 1;
 | 
						|
  r0 = *x (1)
 | 
						|
  RMB
 | 
						|
  r1 = *y (0)
 | 
						|
			SC *y, t;
 | 
						|
 | 
						|
is allowed.
 | 
						|
 | 
						|
 | 
						|
CMPXCHG vs TRY_CMPXCHG
 | 
						|
----------------------
 | 
						|
 | 
						|
  int atomic_cmpxchg(atomic_t *ptr, int old, int new);
 | 
						|
  bool atomic_try_cmpxchg(atomic_t *ptr, int *oldp, int new);
 | 
						|
 | 
						|
Both provide the same functionality, but try_cmpxchg() can lead to more
 | 
						|
compact code. The functions relate like:
 | 
						|
 | 
						|
  bool atomic_try_cmpxchg(atomic_t *ptr, int *oldp, int new)
 | 
						|
  {
 | 
						|
    int ret, old = *oldp;
 | 
						|
    ret = atomic_cmpxchg(ptr, old, new);
 | 
						|
    if (ret != old)
 | 
						|
      *oldp = ret;
 | 
						|
    return ret == old;
 | 
						|
  }
 | 
						|
 | 
						|
and:
 | 
						|
 | 
						|
  int atomic_cmpxchg(atomic_t *ptr, int old, int new)
 | 
						|
  {
 | 
						|
    (void)atomic_try_cmpxchg(ptr, &old, new);
 | 
						|
    return old;
 | 
						|
  }
 | 
						|
 | 
						|
Usage:
 | 
						|
 | 
						|
  old = atomic_read(&v);			old = atomic_read(&v);
 | 
						|
  for (;;) {					do {
 | 
						|
    new = func(old);				  new = func(old);
 | 
						|
    tmp = atomic_cmpxchg(&v, old, new);		} while (!atomic_try_cmpxchg(&v, &old, new));
 | 
						|
    if (tmp == old)
 | 
						|
      break;
 | 
						|
    old = tmp;
 | 
						|
  }
 | 
						|
 | 
						|
NB. try_cmpxchg() also generates better code on some platforms (notably x86)
 | 
						|
where the function more closely matches the hardware instruction.
 | 
						|
 | 
						|
 | 
						|
FORWARD PROGRESS
 | 
						|
----------------
 | 
						|
 | 
						|
In general strong forward progress is expected of all unconditional atomic
 | 
						|
operations -- those in the Arithmetic and Bitwise classes and xchg(). However
 | 
						|
a fair amount of code also requires forward progress from the conditional
 | 
						|
atomic operations.
 | 
						|
 | 
						|
Specifically 'simple' cmpxchg() loops are expected to not starve one another
 | 
						|
indefinitely. However, this is not evident on LL/SC architectures, because
 | 
						|
while an LL/SC architecture 'can/should/must' provide forward progress
 | 
						|
guarantees between competing LL/SC sections, such a guarantee does not
 | 
						|
transfer to cmpxchg() implemented using LL/SC. Consider:
 | 
						|
 | 
						|
  old = atomic_read(&v);
 | 
						|
  do {
 | 
						|
    new = func(old);
 | 
						|
  } while (!atomic_try_cmpxchg(&v, &old, new));
 | 
						|
 | 
						|
which on LL/SC becomes something like:
 | 
						|
 | 
						|
  old = atomic_read(&v);
 | 
						|
  do {
 | 
						|
    new = func(old);
 | 
						|
  } while (!({
 | 
						|
    volatile asm ("1: LL  %[oldval], %[v]\n"
 | 
						|
                  "   CMP %[oldval], %[old]\n"
 | 
						|
                  "   BNE 2f\n"
 | 
						|
                  "   SC  %[new], %[v]\n"
 | 
						|
                  "   BNE 1b\n"
 | 
						|
                  "2:\n"
 | 
						|
                  : [oldval] "=&r" (oldval), [v] "m" (v)
 | 
						|
		  : [old] "r" (old), [new] "r" (new)
 | 
						|
                  : "memory");
 | 
						|
    success = (oldval == old);
 | 
						|
    if (!success)
 | 
						|
      old = oldval;
 | 
						|
    success; }));
 | 
						|
 | 
						|
However, even the forward branch from the failed compare can cause the LL/SC
 | 
						|
to fail on some architectures, let alone whatever the compiler makes of the C
 | 
						|
loop body. As a result there is no guarantee what so ever the cacheline
 | 
						|
containing @v will stay on the local CPU and progress is made.
 | 
						|
 | 
						|
Even native CAS architectures can fail to provide forward progress for their
 | 
						|
primitive (See Sparc64 for an example).
 | 
						|
 | 
						|
Such implementations are strongly encouraged to add exponential backoff loops
 | 
						|
to a failed CAS in order to ensure some progress. Affected architectures are
 | 
						|
also strongly encouraged to inspect/audit the atomic fallbacks, refcount_t and
 | 
						|
their locking primitives.
 |