mirror of
				git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
				synced 2025-10-31 16:54:21 +00:00 
			
		
		
		
	Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull RCU updates from Ingo Molnar:
 "The main changes in this cycle were:
   - Dynamic tick (nohz) updates, perhaps most notably changes to force
     the tick on when needed due to lengthy in-kernel execution on CPUs
     on which RCU is waiting.
   - Linux-kernel memory consistency model updates.
   - Replace rcu_swap_protected() with rcu_prepace_pointer().
   - Torture-test updates.
   - Documentation updates.
   - Miscellaneous fixes"
* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (51 commits)
  security/safesetid: Replace rcu_swap_protected() with rcu_replace_pointer()
  net/sched: Replace rcu_swap_protected() with rcu_replace_pointer()
  net/netfilter: Replace rcu_swap_protected() with rcu_replace_pointer()
  net/core: Replace rcu_swap_protected() with rcu_replace_pointer()
  bpf/cgroup: Replace rcu_swap_protected() with rcu_replace_pointer()
  fs/afs: Replace rcu_swap_protected() with rcu_replace_pointer()
  drivers/scsi: Replace rcu_swap_protected() with rcu_replace_pointer()
  drm/i915: Replace rcu_swap_protected() with rcu_replace_pointer()
  x86/kvm/pmu: Replace rcu_swap_protected() with rcu_replace_pointer()
  rcu: Upgrade rcu_swap_protected() to rcu_replace_pointer()
  rcu: Suppress levelspread uninitialized messages
  rcu: Fix uninitialized variable in nocb_gp_wait()
  rcu: Update descriptions for rcu_future_grace_period tracepoint
  rcu: Update descriptions for rcu_nocb_wake tracepoint
  rcu: Remove obsolete descriptions for rcu_barrier tracepoint
  rcu: Ensure that ->rcu_urgent_qs is set before resched IPI
  workqueue: Convert for_each_wq to use built-in list check
  rcu: Several rcu_segcblist functions can be static
  rcu: Remove unused function hlist_bl_del_init_rcu()
  Documentation: Rename rcu_node_context_switch() to rcu_note_context_switch()
  ...
			
			
This commit is contained in:
		
						commit
						1ae78780ed
					
				
					 64 changed files with 5829 additions and 6390 deletions
				
			
		
										
											
												File diff suppressed because it is too large
												Load diff
											
										
									
								
							
							
								
								
									
										1163
									
								
								Documentation/RCU/Design/Data-Structures/Data-Structures.rst
									
										
									
									
									
										Normal file
									
								
							
							
						
						
									
										1163
									
								
								Documentation/RCU/Design/Data-Structures/Data-Structures.rst
									
										
									
									
									
										Normal file
									
								
							
										
											
												File diff suppressed because it is too large
												Load diff
											
										
									
								
							|  | @ -1,668 +0,0 @@ | |||
| <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" | ||||
|         "http://www.w3.org/TR/html4/loose.dtd"> | ||||
|         <html> | ||||
|         <head><title>A Tour Through TREE_RCU's Expedited Grace Periods</title> | ||||
|         <meta HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1"> | ||||
| 
 | ||||
| <h2>Introduction</h2> | ||||
| 
 | ||||
| This document describes RCU's expedited grace periods. | ||||
| Unlike RCU's normal grace periods, which accept long latencies to attain | ||||
| high efficiency and minimal disturbance, expedited grace periods accept | ||||
| lower efficiency and significant disturbance to attain shorter latencies. | ||||
| 
 | ||||
| <p> | ||||
| There are two flavors of RCU (RCU-preempt and RCU-sched), with an earlier | ||||
| third RCU-bh flavor having been implemented in terms of the other two. | ||||
| Each of the two implementations is covered in its own section. | ||||
| 
 | ||||
| <ol> | ||||
| <li>	<a href="#Expedited Grace Period Design"> | ||||
| 	Expedited Grace Period Design</a> | ||||
| <li>	<a href="#RCU-preempt Expedited Grace Periods"> | ||||
| 	RCU-preempt Expedited Grace Periods</a> | ||||
| <li>	<a href="#RCU-sched Expedited Grace Periods"> | ||||
| 	RCU-sched Expedited Grace Periods</a> | ||||
| <li>	<a href="#Expedited Grace Period and CPU Hotplug"> | ||||
| 	Expedited Grace Period and CPU Hotplug</a> | ||||
| <li>	<a href="#Expedited Grace Period Refinements"> | ||||
| 	Expedited Grace Period Refinements</a> | ||||
| </ol> | ||||
| 
 | ||||
| <h2><a name="Expedited Grace Period Design"> | ||||
| Expedited Grace Period Design</a></h2> | ||||
| 
 | ||||
| <p> | ||||
| The expedited RCU grace periods cannot be accused of being subtle, | ||||
| given that they for all intents and purposes hammer every CPU that | ||||
| has not yet provided a quiescent state for the current expedited | ||||
| grace period. | ||||
| The one saving grace is that the hammer has grown a bit smaller | ||||
| over time:  The old call to <tt>try_stop_cpus()</tt> has been | ||||
| replaced with a set of calls to <tt>smp_call_function_single()</tt>, | ||||
| each of which results in an IPI to the target CPU. | ||||
| The corresponding handler function checks the CPU's state, motivating | ||||
| a faster quiescent state where possible, and triggering a report | ||||
| of that quiescent state. | ||||
| As always for RCU, once everything has spent some time in a quiescent | ||||
| state, the expedited grace period has completed. | ||||
| 
 | ||||
| <p> | ||||
| The details of the <tt>smp_call_function_single()</tt> handler's | ||||
| operation depend on the RCU flavor, as described in the following | ||||
| sections. | ||||
| 
 | ||||
| <h2><a name="RCU-preempt Expedited Grace Periods"> | ||||
| RCU-preempt Expedited Grace Periods</a></h2> | ||||
| 
 | ||||
| <p> | ||||
| <tt>CONFIG_PREEMPT=y</tt> kernels implement RCU-preempt. | ||||
| The overall flow of the handling of a given CPU by an RCU-preempt | ||||
| expedited grace period is shown in the following diagram: | ||||
| 
 | ||||
| <p><img src="ExpRCUFlow.svg" alt="ExpRCUFlow.svg" width="55%"> | ||||
| 
 | ||||
| <p> | ||||
| The solid arrows denote direct action, for example, a function call. | ||||
| The dotted arrows denote indirect action, for example, an IPI | ||||
| or a state that is reached after some time. | ||||
| 
 | ||||
| <p> | ||||
| If a given CPU is offline or idle, <tt>synchronize_rcu_expedited()</tt> | ||||
| will ignore it because idle and offline CPUs are already residing | ||||
| in quiescent states. | ||||
| Otherwise, the expedited grace period will use | ||||
| <tt>smp_call_function_single()</tt> to send the CPU an IPI, which | ||||
| is handled by <tt>rcu_exp_handler()</tt>. | ||||
| 
 | ||||
| <p> | ||||
| However, because this is preemptible RCU, <tt>rcu_exp_handler()</tt> | ||||
| can check to see if the CPU is currently running in an RCU read-side | ||||
| critical section. | ||||
| If not, the handler can immediately report a quiescent state. | ||||
| Otherwise, it sets flags so that the outermost <tt>rcu_read_unlock()</tt> | ||||
| invocation will provide the needed quiescent-state report. | ||||
| This flag-setting avoids the previous forced preemption of all | ||||
| CPUs that might have RCU read-side critical sections. | ||||
| In addition, this flag-setting is done so as to avoid increasing | ||||
| the overhead of the common-case fastpath through the scheduler. | ||||
| 
 | ||||
| <p> | ||||
| Again because this is preemptible RCU, an RCU read-side critical section | ||||
| can be preempted. | ||||
| When that happens, RCU will enqueue the task, which will the continue to | ||||
| block the current expedited grace period until it resumes and finds its | ||||
| outermost <tt>rcu_read_unlock()</tt>. | ||||
| The CPU will report a quiescent state just after enqueuing the task because | ||||
| the CPU is no longer blocking the grace period. | ||||
| It is instead the preempted task doing the blocking. | ||||
| The list of blocked tasks is managed by <tt>rcu_preempt_ctxt_queue()</tt>, | ||||
| which is called from <tt>rcu_preempt_note_context_switch()</tt>, which | ||||
| in turn is called from <tt>rcu_note_context_switch()</tt>, which in | ||||
| turn is called from the scheduler. | ||||
| 
 | ||||
| <table> | ||||
| <tr><th> </th></tr> | ||||
| <tr><th align="left">Quick Quiz:</th></tr> | ||||
| <tr><td> | ||||
| 	Why not just have the expedited grace period check the | ||||
| 	state of all the CPUs? | ||||
| 	After all, that would avoid all those real-time-unfriendly IPIs. | ||||
| </td></tr> | ||||
| <tr><th align="left">Answer:</th></tr> | ||||
| <tr><td bgcolor="#ffffff"><font color="ffffff"> | ||||
| 	Because we want the RCU read-side critical sections to run fast, | ||||
| 	which means no memory barriers. | ||||
| 	Therefore, it is not possible to safely check the state from some | ||||
| 	other CPU. | ||||
| 	And even if it was possible to safely check the state, it would | ||||
| 	still be necessary to IPI the CPU to safely interact with the | ||||
| 	upcoming <tt>rcu_read_unlock()</tt> invocation, which means that | ||||
| 	the remote state testing would not help the worst-case | ||||
| 	latency that real-time applications care about. | ||||
| 
 | ||||
| 	<p><font color="ffffff">One way to prevent your real-time | ||||
| 	application from getting hit with these IPIs is to | ||||
| 	build your kernel with <tt>CONFIG_NO_HZ_FULL=y</tt>. | ||||
| 	RCU would then perceive the CPU running your application | ||||
| 	as being idle, and it would be able to safely detect that | ||||
| 	state without needing to IPI the CPU. | ||||
| </font></td></tr> | ||||
| <tr><td> </td></tr> | ||||
| </table> | ||||
| 
 | ||||
| <p> | ||||
| Please note that this is just the overall flow: | ||||
| Additional complications can arise due to races with CPUs going idle | ||||
| or offline, among other things. | ||||
| 
 | ||||
| <h2><a name="RCU-sched Expedited Grace Periods"> | ||||
| RCU-sched Expedited Grace Periods</a></h2> | ||||
| 
 | ||||
| <p> | ||||
| <tt>CONFIG_PREEMPT=n</tt> kernels implement RCU-sched. | ||||
| The overall flow of the handling of a given CPU by an RCU-sched | ||||
| expedited grace period is shown in the following diagram: | ||||
| 
 | ||||
| <p><img src="ExpSchedFlow.svg" alt="ExpSchedFlow.svg" width="55%"> | ||||
| 
 | ||||
| <p> | ||||
| As with RCU-preempt, RCU-sched's | ||||
| <tt>synchronize_rcu_expedited()</tt> ignores offline and | ||||
| idle CPUs, again because they are in remotely detectable | ||||
| quiescent states. | ||||
| However, because the | ||||
| <tt>rcu_read_lock_sched()</tt> and <tt>rcu_read_unlock_sched()</tt> | ||||
| leave no trace of their invocation, in general it is not possible to tell | ||||
| whether or not the current CPU is in an RCU read-side critical section. | ||||
| The best that RCU-sched's <tt>rcu_exp_handler()</tt> can do is to check | ||||
| for idle, on the off-chance that the CPU went idle while the IPI | ||||
| was in flight. | ||||
| If the CPU is idle, then <tt>rcu_exp_handler()</tt> reports | ||||
| the quiescent state. | ||||
| 
 | ||||
| <p> Otherwise, the handler forces a future context switch by setting the | ||||
| NEED_RESCHED flag of the current task's thread flag and the CPU preempt | ||||
| counter. | ||||
| At the time of the context switch, the CPU reports the quiescent state. | ||||
| Should the CPU go offline first, it will report the quiescent state | ||||
| at that time. | ||||
| 
 | ||||
| <h2><a name="Expedited Grace Period and CPU Hotplug"> | ||||
| Expedited Grace Period and CPU Hotplug</a></h2> | ||||
| 
 | ||||
| <p> | ||||
| The expedited nature of expedited grace periods require a much tighter | ||||
| interaction with CPU hotplug operations than is required for normal | ||||
| grace periods. | ||||
| In addition, attempting to IPI offline CPUs will result in splats, but | ||||
| failing to IPI online CPUs can result in too-short grace periods. | ||||
| Neither option is acceptable in production kernels. | ||||
| 
 | ||||
| <p> | ||||
| The interaction between expedited grace periods and CPU hotplug operations | ||||
| is carried out at several levels: | ||||
| 
 | ||||
| <ol> | ||||
| <li>	The number of CPUs that have ever been online is tracked | ||||
| 	by the <tt>rcu_state</tt> structure's <tt>->ncpus</tt> | ||||
| 	field. | ||||
| 	The <tt>rcu_state</tt> structure's <tt>->ncpus_snap</tt> | ||||
| 	field tracks the number of CPUs that have ever been online | ||||
| 	at the beginning of an RCU expedited grace period. | ||||
| 	Note that this number never decreases, at least in the absence | ||||
| 	of a time machine. | ||||
| <li>	The identities of the CPUs that have ever been online is | ||||
| 	tracked by the <tt>rcu_node</tt> structure's | ||||
| 	<tt>->expmaskinitnext</tt> field. | ||||
| 	The <tt>rcu_node</tt> structure's <tt>->expmaskinit</tt> | ||||
| 	field tracks the identities of the CPUs that were online | ||||
| 	at least once at the beginning of the most recent RCU | ||||
| 	expedited grace period. | ||||
| 	The <tt>rcu_state</tt> structure's <tt>->ncpus</tt> and | ||||
| 	<tt>->ncpus_snap</tt> fields are used to detect when | ||||
| 	new CPUs have come online for the first time, that is, | ||||
| 	when the <tt>rcu_node</tt> structure's <tt>->expmaskinitnext</tt> | ||||
| 	field has changed since the beginning of the last RCU | ||||
| 	expedited grace period, which triggers an update of each | ||||
| 	<tt>rcu_node</tt> structure's <tt>->expmaskinit</tt> | ||||
| 	field from its <tt>->expmaskinitnext</tt> field. | ||||
| <li>	Each <tt>rcu_node</tt> structure's <tt>->expmaskinit</tt> | ||||
| 	field is used to initialize that structure's | ||||
| 	<tt>->expmask</tt> at the beginning of each RCU | ||||
| 	expedited grace period. | ||||
| 	This means that only those CPUs that have been online at least | ||||
| 	once will be considered for a given grace period. | ||||
| <li>	Any CPU that goes offline will clear its bit in its leaf | ||||
| 	<tt>rcu_node</tt> structure's <tt>->qsmaskinitnext</tt> | ||||
| 	field, so any CPU with that bit clear can safely be ignored. | ||||
| 	However, it is possible for a CPU coming online or going offline | ||||
| 	to have this bit set for some time while <tt>cpu_online</tt> | ||||
| 	returns <tt>false</tt>. | ||||
| <li>	For each non-idle CPU that RCU believes is currently online, the grace | ||||
| 	period invokes <tt>smp_call_function_single()</tt>. | ||||
| 	If this succeeds, the CPU was fully online. | ||||
| 	Failure indicates that the CPU is in the process of coming online | ||||
| 	or going offline, in which case it is necessary to wait for a | ||||
| 	short time period and try again. | ||||
| 	The purpose of this wait (or series of waits, as the case may be) | ||||
| 	is to permit a concurrent CPU-hotplug operation to complete. | ||||
| <li>	In the case of RCU-sched, one of the last acts of an outgoing CPU | ||||
| 	is to invoke <tt>rcu_report_dead()</tt>, which | ||||
| 	reports a quiescent state for that CPU. | ||||
| 	However, this is likely paranoia-induced redundancy. <!-- @@@ --> | ||||
| </ol> | ||||
| 
 | ||||
| <table> | ||||
| <tr><th> </th></tr> | ||||
| <tr><th align="left">Quick Quiz:</th></tr> | ||||
| <tr><td> | ||||
| 	Why all the dancing around with multiple counters and masks | ||||
| 	tracking CPUs that were once online? | ||||
| 	Why not just have a single set of masks tracking the currently | ||||
| 	online CPUs and be done with it? | ||||
| </td></tr> | ||||
| <tr><th align="left">Answer:</th></tr> | ||||
| <tr><td bgcolor="#ffffff"><font color="ffffff"> | ||||
| 	Maintaining single set of masks tracking the online CPUs <i>sounds</i> | ||||
| 	easier, at least until you try working out all the race conditions | ||||
| 	between grace-period initialization and CPU-hotplug operations. | ||||
| 	For example, suppose initialization is progressing down the | ||||
| 	tree while a CPU-offline operation is progressing up the tree. | ||||
| 	This situation can result in bits set at the top of the tree | ||||
| 	that have no counterparts at the bottom of the tree. | ||||
| 	Those bits will never be cleared, which will result in | ||||
| 	grace-period hangs. | ||||
| 	In short, that way lies madness, to say nothing of a great many | ||||
| 	bugs, hangs, and deadlocks. | ||||
| 
 | ||||
| 	<p><font color="ffffff"> | ||||
| 	In contrast, the current multi-mask multi-counter scheme ensures | ||||
| 	that grace-period initialization will always see consistent masks | ||||
| 	up and down the tree, which brings significant simplifications | ||||
| 	over the single-mask method. | ||||
| 
 | ||||
| 	<p><font color="ffffff"> | ||||
| 	This is an instance of | ||||
| 	<a href="http://www.cs.columbia.edu/~library/TR-repository/reports/reports-1992/cucs-039-92.ps.gz"><font color="ffffff"> | ||||
| 	deferring work in order to avoid synchronization</a>. | ||||
| 	Lazily recording CPU-hotplug events at the beginning of the next | ||||
| 	grace period greatly simplifies maintenance of the CPU-tracking | ||||
| 	bitmasks in the <tt>rcu_node</tt> tree. | ||||
| </font></td></tr> | ||||
| <tr><td> </td></tr> | ||||
| </table> | ||||
| 
 | ||||
| <h2><a name="Expedited Grace Period Refinements"> | ||||
| Expedited Grace Period Refinements</a></h2> | ||||
| 
 | ||||
| <ol> | ||||
| <li>	<a href="#Idle-CPU Checks">Idle-CPU checks</a>. | ||||
| <li>	<a href="#Batching via Sequence Counter"> | ||||
| 	Batching via sequence counter</a>. | ||||
| <li>	<a href="#Funnel Locking and Wait/Wakeup"> | ||||
| 	Funnel locking and wait/wakeup</a>. | ||||
| <li>	<a href="#Use of Workqueues">Use of Workqueues</a>. | ||||
| <li>	<a href="#Stall Warnings">Stall warnings</a>. | ||||
| <li>	<a href="#Mid-Boot Operation">Mid-boot operation</a>. | ||||
| </ol> | ||||
| 
 | ||||
| <h3><a name="Idle-CPU Checks">Idle-CPU Checks</a></h3> | ||||
| 
 | ||||
| <p> | ||||
| Each expedited grace period checks for idle CPUs when initially forming | ||||
| the mask of CPUs to be IPIed and again just before IPIing a CPU | ||||
| (both checks are carried out by <tt>sync_rcu_exp_select_cpus()</tt>). | ||||
| If the CPU is idle at any time between those two times, the CPU will | ||||
| not be IPIed. | ||||
| Instead, the task pushing the grace period forward will include the | ||||
| idle CPUs in the mask passed to <tt>rcu_report_exp_cpu_mult()</tt>. | ||||
| 
 | ||||
| <p> | ||||
| For RCU-sched, there is an additional check: | ||||
| If the IPI has interrupted the idle loop, then | ||||
| <tt>rcu_exp_handler()</tt> invokes <tt>rcu_report_exp_rdp()</tt> | ||||
| to report the corresponding quiescent state. | ||||
| 
 | ||||
| <p> | ||||
| For RCU-preempt, there is no specific check for idle in the | ||||
| IPI handler (<tt>rcu_exp_handler()</tt>), but because | ||||
| RCU read-side critical sections are not permitted within the | ||||
| idle loop, if <tt>rcu_exp_handler()</tt> sees that the CPU is within | ||||
| RCU read-side critical section, the CPU cannot possibly be idle. | ||||
| Otherwise, <tt>rcu_exp_handler()</tt> invokes | ||||
| <tt>rcu_report_exp_rdp()</tt> to report the corresponding quiescent | ||||
| state, regardless of whether or not that quiescent state was due to | ||||
| the CPU being idle. | ||||
| 
 | ||||
| <p> | ||||
| In summary, RCU expedited grace periods check for idle when building | ||||
| the bitmask of CPUs that must be IPIed, just before sending each IPI, | ||||
| and (either explicitly or implicitly) within the IPI handler. | ||||
| 
 | ||||
| <h3><a name="Batching via Sequence Counter"> | ||||
| Batching via Sequence Counter</a></h3> | ||||
| 
 | ||||
| <p> | ||||
| If each grace-period request was carried out separately, expedited | ||||
| grace periods would have abysmal scalability and | ||||
| problematic high-load characteristics. | ||||
| Because each grace-period operation can serve an unlimited number of | ||||
| updates, it is important to <i>batch</i> requests, so that a single | ||||
| expedited grace-period operation will cover all requests in the | ||||
| corresponding batch. | ||||
| 
 | ||||
| <p> | ||||
| This batching is controlled by a sequence counter named | ||||
| <tt>->expedited_sequence</tt> in the <tt>rcu_state</tt> structure. | ||||
| This counter has an odd value when there is an expedited grace period | ||||
| in progress and an even value otherwise, so that dividing the counter | ||||
| value by two gives the number of completed grace periods. | ||||
| During any given update request, the counter must transition from | ||||
| even to odd and then back to even, thus indicating that a grace | ||||
| period has elapsed. | ||||
| Therefore, if the initial value of the counter is <tt>s</tt>, | ||||
| the updater must wait until the counter reaches at least the | ||||
| value <tt>(s+3)&~0x1</tt>. | ||||
| This counter is managed by the following access functions: | ||||
| 
 | ||||
| <ol> | ||||
| <li>	<tt>rcu_exp_gp_seq_start()</tt>, which marks the start of | ||||
| 	an expedited grace period. | ||||
| <li>	<tt>rcu_exp_gp_seq_end()</tt>, which marks the end of an | ||||
| 	expedited grace period. | ||||
| <li>	<tt>rcu_exp_gp_seq_snap()</tt>, which obtains a snapshot of | ||||
| 	the counter. | ||||
| <li>	<tt>rcu_exp_gp_seq_done()</tt>, which returns <tt>true</tt> | ||||
| 	if a full expedited grace period has elapsed since the | ||||
| 	corresponding call to <tt>rcu_exp_gp_seq_snap()</tt>. | ||||
| </ol> | ||||
| 
 | ||||
| <p> | ||||
| Again, only one request in a given batch need actually carry out | ||||
| a grace-period operation, which means there must be an efficient | ||||
| way to identify which of many concurrent reqeusts will initiate | ||||
| the grace period, and that there be an efficient way for the | ||||
| remaining requests to wait for that grace period to complete. | ||||
| However, that is the topic of the next section. | ||||
| 
 | ||||
| <h3><a name="Funnel Locking and Wait/Wakeup"> | ||||
| Funnel Locking and Wait/Wakeup</a></h3> | ||||
| 
 | ||||
| <p> | ||||
| The natural way to sort out which of a batch of updaters will initiate | ||||
| the expedited grace period is to use the <tt>rcu_node</tt> combining | ||||
| tree, as implemented by the <tt>exp_funnel_lock()</tt> function. | ||||
| The first updater corresponding to a given grace period arriving | ||||
| at a given <tt>rcu_node</tt> structure records its desired grace-period | ||||
| sequence number in the <tt>->exp_seq_rq</tt> field and moves up | ||||
| to the next level in the tree. | ||||
| Otherwise, if the <tt>->exp_seq_rq</tt> field already contains | ||||
| the sequence number for the desired grace period or some later one, | ||||
| the updater blocks on one of four wait queues in the | ||||
| <tt>->exp_wq[]</tt> array, using the second-from-bottom | ||||
| and third-from bottom bits as an index. | ||||
| An <tt>->exp_lock</tt> field in the <tt>rcu_node</tt> structure | ||||
| synchronizes access to these fields. | ||||
| 
 | ||||
| <p> | ||||
| An empty <tt>rcu_node</tt> tree is shown in the following diagram, | ||||
| with the white cells representing the <tt>->exp_seq_rq</tt> field | ||||
| and the red cells representing the elements of the | ||||
| <tt>->exp_wq[]</tt> array. | ||||
| 
 | ||||
| <p><img src="Funnel0.svg" alt="Funnel0.svg" width="75%"> | ||||
| 
 | ||||
| <p> | ||||
| The next diagram shows the situation after the arrival of Task A | ||||
| and Task B at the leftmost and rightmost leaf <tt>rcu_node</tt> | ||||
| structures, respectively. | ||||
| The current value of the <tt>rcu_state</tt> structure's | ||||
| <tt>->expedited_sequence</tt> field is zero, so adding three and | ||||
| clearing the bottom bit results in the value two, which both tasks | ||||
| record in the <tt>->exp_seq_rq</tt> field of their respective | ||||
| <tt>rcu_node</tt> structures: | ||||
| 
 | ||||
| <p><img src="Funnel1.svg" alt="Funnel1.svg" width="75%"> | ||||
| 
 | ||||
| <p> | ||||
| Each of Tasks A and B will move up to the root | ||||
| <tt>rcu_node</tt> structure. | ||||
| Suppose that Task A wins, recording its desired grace-period sequence | ||||
| number and resulting in the state shown below: | ||||
| 
 | ||||
| <p><img src="Funnel2.svg" alt="Funnel2.svg" width="75%"> | ||||
| 
 | ||||
| <p> | ||||
| Task A now advances to initiate a new grace period, while Task B | ||||
| moves up to the root <tt>rcu_node</tt> structure, and, seeing that | ||||
| its desired sequence number is already recorded, blocks on | ||||
| <tt>->exp_wq[1]</tt>. | ||||
| 
 | ||||
| <table> | ||||
| <tr><th> </th></tr> | ||||
| <tr><th align="left">Quick Quiz:</th></tr> | ||||
| <tr><td> | ||||
| 	Why <tt>->exp_wq[1]</tt>? | ||||
| 	Given that the value of these tasks' desired sequence number is | ||||
| 	two, so shouldn't they instead block on <tt>->exp_wq[2]</tt>? | ||||
| </td></tr> | ||||
| <tr><th align="left">Answer:</th></tr> | ||||
| <tr><td bgcolor="#ffffff"><font color="ffffff"> | ||||
| 	No. | ||||
| 
 | ||||
| 	<p><font color="ffffff"> | ||||
| 	Recall that the bottom bit of the desired sequence number indicates | ||||
| 	whether or not a grace period is currently in progress. | ||||
| 	It is therefore necessary to shift the sequence number right one | ||||
| 	bit position to obtain the number of the grace period. | ||||
| 	This results in <tt>->exp_wq[1]</tt>. | ||||
| </font></td></tr> | ||||
| <tr><td> </td></tr> | ||||
| </table> | ||||
| 
 | ||||
| <p> | ||||
| If Tasks C and D also arrive at this point, they will compute the | ||||
| same desired grace-period sequence number, and see that both leaf | ||||
| <tt>rcu_node</tt> structures already have that value recorded. | ||||
| They will therefore block on their respective <tt>rcu_node</tt> | ||||
| structures' <tt>->exp_wq[1]</tt> fields, as shown below: | ||||
| 
 | ||||
| <p><img src="Funnel3.svg" alt="Funnel3.svg" width="75%"> | ||||
| 
 | ||||
| <p> | ||||
| Task A now acquires the <tt>rcu_state</tt> structure's | ||||
| <tt>->exp_mutex</tt> and initiates the grace period, which | ||||
| increments <tt>->expedited_sequence</tt>. | ||||
| Therefore, if Tasks E and F arrive, they will compute | ||||
| a desired sequence number of 4 and will record this value as | ||||
| shown below: | ||||
| 
 | ||||
| <p><img src="Funnel4.svg" alt="Funnel4.svg" width="75%"> | ||||
| 
 | ||||
| <p> | ||||
| Tasks E and F will propagate up the <tt>rcu_node</tt> | ||||
| combining tree, with Task F blocking on the root <tt>rcu_node</tt> | ||||
| structure and Task E wait for Task A to finish so that | ||||
| it can start the next grace period. | ||||
| The resulting state is as shown below: | ||||
| 
 | ||||
| <p><img src="Funnel5.svg" alt="Funnel5.svg" width="75%"> | ||||
| 
 | ||||
| <p> | ||||
| Once the grace period completes, Task A | ||||
| starts waking up the tasks waiting for this grace period to complete, | ||||
| increments the <tt>->expedited_sequence</tt>, | ||||
| acquires the <tt>->exp_wake_mutex</tt> and then releases the | ||||
| <tt>->exp_mutex</tt>. | ||||
| This results in the following state: | ||||
| 
 | ||||
| <p><img src="Funnel6.svg" alt="Funnel6.svg" width="75%"> | ||||
| 
 | ||||
| <p> | ||||
| Task E can then acquire <tt>->exp_mutex</tt> and increment | ||||
| <tt>->expedited_sequence</tt> to the value three. | ||||
| If new tasks G and H arrive and moves up the combining tree at the | ||||
| same time, the state will be as follows: | ||||
| 
 | ||||
| <p><img src="Funnel7.svg" alt="Funnel7.svg" width="75%"> | ||||
| 
 | ||||
| <p> | ||||
| Note that three of the root <tt>rcu_node</tt> structure's | ||||
| waitqueues are now occupied. | ||||
| However, at some point, Task A will wake up the | ||||
| tasks blocked on the <tt>->exp_wq</tt> waitqueues, resulting | ||||
| in the following state: | ||||
| 
 | ||||
| <p><img src="Funnel8.svg" alt="Funnel8.svg" width="75%"> | ||||
| 
 | ||||
| <p> | ||||
| Execution will continue with Tasks E and H completing | ||||
| their grace periods and carrying out their wakeups. | ||||
| 
 | ||||
| <table> | ||||
| <tr><th> </th></tr> | ||||
| <tr><th align="left">Quick Quiz:</th></tr> | ||||
| <tr><td> | ||||
| 	What happens if Task A takes so long to do its wakeups | ||||
| 	that Task E's grace period completes? | ||||
| </td></tr> | ||||
| <tr><th align="left">Answer:</th></tr> | ||||
| <tr><td bgcolor="#ffffff"><font color="ffffff"> | ||||
| 	Then Task E will block on the <tt>->exp_wake_mutex</tt>, | ||||
| 	which will also prevent it from releasing <tt>->exp_mutex</tt>, | ||||
| 	which in turn will prevent the next grace period from starting. | ||||
| 	This last is important in preventing overflow of the | ||||
| 	<tt>->exp_wq[]</tt> array. | ||||
| </font></td></tr> | ||||
| <tr><td> </td></tr> | ||||
| </table> | ||||
| 
 | ||||
| <h3><a name="Use of Workqueues">Use of Workqueues</a></h3> | ||||
| 
 | ||||
| <p> | ||||
| In earlier implementations, the task requesting the expedited | ||||
| grace period also drove it to completion. | ||||
| This straightforward approach had the disadvantage of needing to | ||||
| account for POSIX signals sent to user tasks, | ||||
| so more recent implemementations use the Linux kernel's | ||||
| <a href="https://www.kernel.org/doc/Documentation/core-api/workqueue.rst">workqueues</a>. | ||||
| 
 | ||||
| <p> | ||||
| The requesting task still does counter snapshotting and funnel-lock | ||||
| processing, but the task reaching the top of the funnel lock | ||||
| does a <tt>schedule_work()</tt> (from <tt>_synchronize_rcu_expedited()</tt> | ||||
| so that a workqueue kthread does the actual grace-period processing. | ||||
| Because workqueue kthreads do not accept POSIX signals, grace-period-wait | ||||
| processing need not allow for POSIX signals. | ||||
| 
 | ||||
| In addition, this approach allows wakeups for the previous expedited | ||||
| grace period to be overlapped with processing for the next expedited | ||||
| grace period. | ||||
| Because there are only four sets of waitqueues, it is necessary to | ||||
| ensure that the previous grace period's wakeups complete before the | ||||
| next grace period's wakeups start. | ||||
| This is handled by having the <tt>->exp_mutex</tt> | ||||
| guard expedited grace-period processing and the | ||||
| <tt>->exp_wake_mutex</tt> guard wakeups. | ||||
| The key point is that the <tt>->exp_mutex</tt> is not released | ||||
| until the first wakeup is complete, which means that the | ||||
| <tt>->exp_wake_mutex</tt> has already been acquired at that point. | ||||
| This approach ensures that the previous grace period's wakeups can | ||||
| be carried out while the current grace period is in process, but | ||||
| that these wakeups will complete before the next grace period starts. | ||||
| This means that only three waitqueues are required, guaranteeing that | ||||
| the four that are provided are sufficient. | ||||
| 
 | ||||
| <h3><a name="Stall Warnings">Stall Warnings</a></h3> | ||||
| 
 | ||||
| <p> | ||||
| Expediting grace periods does nothing to speed things up when RCU | ||||
| readers take too long, and therefore expedited grace periods check | ||||
| for stalls just as normal grace periods do. | ||||
| 
 | ||||
| <table> | ||||
| <tr><th> </th></tr> | ||||
| <tr><th align="left">Quick Quiz:</th></tr> | ||||
| <tr><td> | ||||
| 	But why not just let the normal grace-period machinery | ||||
| 	detect the stalls, given that a given reader must block | ||||
| 	both normal and expedited grace periods? | ||||
| </td></tr> | ||||
| <tr><th align="left">Answer:</th></tr> | ||||
| <tr><td bgcolor="#ffffff"><font color="ffffff"> | ||||
| 	Because it is quite possible that at a given time there | ||||
| 	is no normal grace period in progress, in which case the | ||||
| 	normal grace period cannot emit a stall warning. | ||||
| </font></td></tr> | ||||
| <tr><td> </td></tr> | ||||
| </table> | ||||
| 
 | ||||
| The <tt>synchronize_sched_expedited_wait()</tt> function loops waiting | ||||
| for the expedited grace period to end, but with a timeout set to the | ||||
| current RCU CPU stall-warning time. | ||||
| If this time is exceeded, any CPUs or <tt>rcu_node</tt> structures | ||||
| blocking the current grace period are printed. | ||||
| Each stall warning results in another pass through the loop, but the | ||||
| second and subsequent passes use longer stall times. | ||||
| 
 | ||||
| <h3><a name="Mid-Boot Operation">Mid-boot operation</a></h3> | ||||
| 
 | ||||
| <p> | ||||
| The use of workqueues has the advantage that the expedited | ||||
| grace-period code need not worry about POSIX signals. | ||||
| Unfortunately, it has the | ||||
| corresponding disadvantage that workqueues cannot be used until | ||||
| they are initialized, which does not happen until some time after | ||||
| the scheduler spawns the first task. | ||||
| Given that there are parts of the kernel that really do want to | ||||
| execute grace periods during this mid-boot “dead zone”, | ||||
| expedited grace periods must do something else during thie time. | ||||
| 
 | ||||
| <p> | ||||
| What they do is to fall back to the old practice of requiring that the | ||||
| requesting task drive the expedited grace period, as was the case | ||||
| before the use of workqueues. | ||||
| However, the requesting task is only required to drive the grace period | ||||
| during the mid-boot dead zone. | ||||
| Before mid-boot, a synchronous grace period is a no-op. | ||||
| Some time after mid-boot, workqueues are used. | ||||
| 
 | ||||
| <p> | ||||
| Non-expedited non-SRCU synchronous grace periods must also operate | ||||
| normally during mid-boot. | ||||
| This is handled by causing non-expedited grace periods to take the | ||||
| expedited code path during mid-boot. | ||||
| 
 | ||||
| <p> | ||||
| The current code assumes that there are no POSIX signals during | ||||
| the mid-boot dead zone. | ||||
| However, if an overwhelming need for POSIX signals somehow arises, | ||||
| appropriate adjustments can be made to the expedited stall-warning code. | ||||
| One such adjustment would reinstate the pre-workqueue stall-warning | ||||
| checks, but only during the mid-boot dead zone. | ||||
| 
 | ||||
| <p> | ||||
| With this refinement, synchronous grace periods can now be used from | ||||
| task context pretty much any time during the life of the kernel. | ||||
| That is, aside from some points in the suspend, hibernate, or shutdown | ||||
| code path. | ||||
| 
 | ||||
| <h3><a name="Summary"> | ||||
| Summary</a></h3> | ||||
| 
 | ||||
| <p> | ||||
| Expedited grace periods use a sequence-number approach to promote | ||||
| batching, so that a single grace-period operation can serve numerous | ||||
| requests. | ||||
| A funnel lock is used to efficiently identify the one task out of | ||||
| a concurrent group that will request the grace period. | ||||
| All members of the group will block on waitqueues provided in | ||||
| the <tt>rcu_node</tt> structure. | ||||
| The actual grace-period processing is carried out by a workqueue. | ||||
| 
 | ||||
| <p> | ||||
| CPU-hotplug operations are noted lazily in order to prevent the need | ||||
| for tight synchronization between expedited grace periods and | ||||
| CPU-hotplug operations. | ||||
| The dyntick-idle counters are used to avoid sending IPIs to idle CPUs, | ||||
| at least in the common case. | ||||
| RCU-preempt and RCU-sched use different IPI handlers and different | ||||
| code to respond to the state changes carried out by those handlers, | ||||
| but otherwise use common code. | ||||
| 
 | ||||
| <p> | ||||
| Quiescent states are tracked using the <tt>rcu_node</tt> tree, | ||||
| and once all necessary quiescent states have been reported, | ||||
| all tasks waiting on this expedited grace period are awakened. | ||||
| A pair of mutexes are used to allow one grace period's wakeups | ||||
| to proceed concurrently with the next grace period's processing. | ||||
| 
 | ||||
| <p> | ||||
| This combination of mechanisms allows expedited grace periods to | ||||
| run reasonably efficiently. | ||||
| However, for non-time-critical tasks, normal grace periods should be | ||||
| used instead because their longer duration permits much higher | ||||
| degrees of batching, and thus much lower per-request overheads. | ||||
| 
 | ||||
| </body></html> | ||||
|  | @ -0,0 +1,521 @@ | |||
| ================================================= | ||||
| A Tour Through TREE_RCU's Expedited Grace Periods | ||||
| ================================================= | ||||
| 
 | ||||
| Introduction | ||||
| ============ | ||||
| 
 | ||||
| This document describes RCU's expedited grace periods. | ||||
| Unlike RCU's normal grace periods, which accept long latencies to attain | ||||
| high efficiency and minimal disturbance, expedited grace periods accept | ||||
| lower efficiency and significant disturbance to attain shorter latencies. | ||||
| 
 | ||||
| There are two flavors of RCU (RCU-preempt and RCU-sched), with an earlier | ||||
| third RCU-bh flavor having been implemented in terms of the other two. | ||||
| Each of the two implementations is covered in its own section. | ||||
| 
 | ||||
| Expedited Grace Period Design | ||||
| ============================= | ||||
| 
 | ||||
| The expedited RCU grace periods cannot be accused of being subtle, | ||||
| given that they for all intents and purposes hammer every CPU that | ||||
| has not yet provided a quiescent state for the current expedited | ||||
| grace period. | ||||
| The one saving grace is that the hammer has grown a bit smaller | ||||
| over time:  The old call to ``try_stop_cpus()`` has been | ||||
| replaced with a set of calls to ``smp_call_function_single()``, | ||||
| each of which results in an IPI to the target CPU. | ||||
| The corresponding handler function checks the CPU's state, motivating | ||||
| a faster quiescent state where possible, and triggering a report | ||||
| of that quiescent state. | ||||
| As always for RCU, once everything has spent some time in a quiescent | ||||
| state, the expedited grace period has completed. | ||||
| 
 | ||||
| The details of the ``smp_call_function_single()`` handler's | ||||
| operation depend on the RCU flavor, as described in the following | ||||
| sections. | ||||
| 
 | ||||
| RCU-preempt Expedited Grace Periods | ||||
| =================================== | ||||
| 
 | ||||
| ``CONFIG_PREEMPT=y`` kernels implement RCU-preempt. | ||||
| The overall flow of the handling of a given CPU by an RCU-preempt | ||||
| expedited grace period is shown in the following diagram: | ||||
| 
 | ||||
| .. kernel-figure:: ExpRCUFlow.svg | ||||
| 
 | ||||
| The solid arrows denote direct action, for example, a function call. | ||||
| The dotted arrows denote indirect action, for example, an IPI | ||||
| or a state that is reached after some time. | ||||
| 
 | ||||
| If a given CPU is offline or idle, ``synchronize_rcu_expedited()`` | ||||
| will ignore it because idle and offline CPUs are already residing | ||||
| in quiescent states. | ||||
| Otherwise, the expedited grace period will use | ||||
| ``smp_call_function_single()`` to send the CPU an IPI, which | ||||
| is handled by ``rcu_exp_handler()``. | ||||
| 
 | ||||
| However, because this is preemptible RCU, ``rcu_exp_handler()`` | ||||
| can check to see if the CPU is currently running in an RCU read-side | ||||
| critical section. | ||||
| If not, the handler can immediately report a quiescent state. | ||||
| Otherwise, it sets flags so that the outermost ``rcu_read_unlock()`` | ||||
| invocation will provide the needed quiescent-state report. | ||||
| This flag-setting avoids the previous forced preemption of all | ||||
| CPUs that might have RCU read-side critical sections. | ||||
| In addition, this flag-setting is done so as to avoid increasing | ||||
| the overhead of the common-case fastpath through the scheduler. | ||||
| 
 | ||||
| Again because this is preemptible RCU, an RCU read-side critical section | ||||
| can be preempted. | ||||
| When that happens, RCU will enqueue the task, which will the continue to | ||||
| block the current expedited grace period until it resumes and finds its | ||||
| outermost ``rcu_read_unlock()``. | ||||
| The CPU will report a quiescent state just after enqueuing the task because | ||||
| the CPU is no longer blocking the grace period. | ||||
| It is instead the preempted task doing the blocking. | ||||
| The list of blocked tasks is managed by ``rcu_preempt_ctxt_queue()``, | ||||
| which is called from ``rcu_preempt_note_context_switch()``, which | ||||
| in turn is called from ``rcu_note_context_switch()``, which in | ||||
| turn is called from the scheduler. | ||||
| 
 | ||||
| 
 | ||||
| +-----------------------------------------------------------------------+ | ||||
| | **Quick Quiz**:                                                       | | ||||
| +-----------------------------------------------------------------------+ | ||||
| | Why not just have the expedited grace period check the state of all   | | ||||
| | the CPUs? After all, that would avoid all those real-time-unfriendly  | | ||||
| | IPIs.                                                                 | | ||||
| +-----------------------------------------------------------------------+ | ||||
| | **Answer**:                                                           | | ||||
| +-----------------------------------------------------------------------+ | ||||
| | Because we want the RCU read-side critical sections to run fast,      | | ||||
| | which means no memory barriers. Therefore, it is not possible to      | | ||||
| | safely check the state from some other CPU. And even if it was        | | ||||
| | possible to safely check the state, it would still be necessary to    | | ||||
| | IPI the CPU to safely interact with the upcoming                      | | ||||
| | ``rcu_read_unlock()`` invocation, which means that the remote state   | | ||||
| | testing would not help the worst-case latency that real-time          | | ||||
| | applications care about.                                              | | ||||
| |                                                                       | | ||||
| | One way to prevent your real-time application from getting hit with   | | ||||
| | these IPIs is to build your kernel with ``CONFIG_NO_HZ_FULL=y``. RCU  | | ||||
| | would then perceive the CPU running your application as being idle,   | | ||||
| | and it would be able to safely detect that state without needing to   | | ||||
| | IPI the CPU.                                                          | | ||||
| +-----------------------------------------------------------------------+ | ||||
| 
 | ||||
| Please note that this is just the overall flow: Additional complications | ||||
| can arise due to races with CPUs going idle or offline, among other | ||||
| things. | ||||
| 
 | ||||
| RCU-sched Expedited Grace Periods | ||||
| --------------------------------- | ||||
| 
 | ||||
| ``CONFIG_PREEMPT=n`` kernels implement RCU-sched. The overall flow of | ||||
| the handling of a given CPU by an RCU-sched expedited grace period is | ||||
| shown in the following diagram: | ||||
| 
 | ||||
| .. kernel-figure:: ExpSchedFlow.svg | ||||
| 
 | ||||
| As with RCU-preempt, RCU-sched's ``synchronize_rcu_expedited()`` ignores | ||||
| offline and idle CPUs, again because they are in remotely detectable | ||||
| quiescent states. However, because the ``rcu_read_lock_sched()`` and | ||||
| ``rcu_read_unlock_sched()`` leave no trace of their invocation, in | ||||
| general it is not possible to tell whether or not the current CPU is in | ||||
| an RCU read-side critical section. The best that RCU-sched's | ||||
| ``rcu_exp_handler()`` can do is to check for idle, on the off-chance | ||||
| that the CPU went idle while the IPI was in flight. If the CPU is idle, | ||||
| then ``rcu_exp_handler()`` reports the quiescent state. | ||||
| 
 | ||||
| Otherwise, the handler forces a future context switch by setting the | ||||
| NEED_RESCHED flag of the current task's thread flag and the CPU preempt | ||||
| counter. At the time of the context switch, the CPU reports the | ||||
| quiescent state. Should the CPU go offline first, it will report the | ||||
| quiescent state at that time. | ||||
| 
 | ||||
| Expedited Grace Period and CPU Hotplug | ||||
| -------------------------------------- | ||||
| 
 | ||||
| The expedited nature of expedited grace periods require a much tighter | ||||
| interaction with CPU hotplug operations than is required for normal | ||||
| grace periods. In addition, attempting to IPI offline CPUs will result | ||||
| in splats, but failing to IPI online CPUs can result in too-short grace | ||||
| periods. Neither option is acceptable in production kernels. | ||||
| 
 | ||||
| The interaction between expedited grace periods and CPU hotplug | ||||
| operations is carried out at several levels: | ||||
| 
 | ||||
| #. The number of CPUs that have ever been online is tracked by the | ||||
|    ``rcu_state`` structure's ``->ncpus`` field. The ``rcu_state`` | ||||
|    structure's ``->ncpus_snap`` field tracks the number of CPUs that | ||||
|    have ever been online at the beginning of an RCU expedited grace | ||||
|    period. Note that this number never decreases, at least in the | ||||
|    absence of a time machine. | ||||
| #. The identities of the CPUs that have ever been online is tracked by | ||||
|    the ``rcu_node`` structure's ``->expmaskinitnext`` field. The | ||||
|    ``rcu_node`` structure's ``->expmaskinit`` field tracks the | ||||
|    identities of the CPUs that were online at least once at the | ||||
|    beginning of the most recent RCU expedited grace period. The | ||||
|    ``rcu_state`` structure's ``->ncpus`` and ``->ncpus_snap`` fields are | ||||
|    used to detect when new CPUs have come online for the first time, | ||||
|    that is, when the ``rcu_node`` structure's ``->expmaskinitnext`` | ||||
|    field has changed since the beginning of the last RCU expedited grace | ||||
|    period, which triggers an update of each ``rcu_node`` structure's | ||||
|    ``->expmaskinit`` field from its ``->expmaskinitnext`` field. | ||||
| #. Each ``rcu_node`` structure's ``->expmaskinit`` field is used to | ||||
|    initialize that structure's ``->expmask`` at the beginning of each | ||||
|    RCU expedited grace period. This means that only those CPUs that have | ||||
|    been online at least once will be considered for a given grace | ||||
|    period. | ||||
| #. Any CPU that goes offline will clear its bit in its leaf ``rcu_node`` | ||||
|    structure's ``->qsmaskinitnext`` field, so any CPU with that bit | ||||
|    clear can safely be ignored. However, it is possible for a CPU coming | ||||
|    online or going offline to have this bit set for some time while | ||||
|    ``cpu_online`` returns ``false``. | ||||
| #. For each non-idle CPU that RCU believes is currently online, the | ||||
|    grace period invokes ``smp_call_function_single()``. If this | ||||
|    succeeds, the CPU was fully online. Failure indicates that the CPU is | ||||
|    in the process of coming online or going offline, in which case it is | ||||
|    necessary to wait for a short time period and try again. The purpose | ||||
|    of this wait (or series of waits, as the case may be) is to permit a | ||||
|    concurrent CPU-hotplug operation to complete. | ||||
| #. In the case of RCU-sched, one of the last acts of an outgoing CPU is | ||||
|    to invoke ``rcu_report_dead()``, which reports a quiescent state for | ||||
|    that CPU. However, this is likely paranoia-induced redundancy. | ||||
| 
 | ||||
| +-----------------------------------------------------------------------+ | ||||
| | **Quick Quiz**:                                                       | | ||||
| +-----------------------------------------------------------------------+ | ||||
| | Why all the dancing around with multiple counters and masks tracking  | | ||||
| | CPUs that were once online? Why not just have a single set of masks   | | ||||
| | tracking the currently online CPUs and be done with it?               | | ||||
| +-----------------------------------------------------------------------+ | ||||
| | **Answer**:                                                           | | ||||
| +-----------------------------------------------------------------------+ | ||||
| | Maintaining single set of masks tracking the online CPUs *sounds*     | | ||||
| | easier, at least until you try working out all the race conditions    | | ||||
| | between grace-period initialization and CPU-hotplug operations. For   | | ||||
| | example, suppose initialization is progressing down the tree while a  | | ||||
| | CPU-offline operation is progressing up the tree. This situation can  | | ||||
| | result in bits set at the top of the tree that have no counterparts   | | ||||
| | at the bottom of the tree. Those bits will never be cleared, which    | | ||||
| | will result in grace-period hangs. In short, that way lies madness,   | | ||||
| | to say nothing of a great many bugs, hangs, and deadlocks.            | | ||||
| | In contrast, the current multi-mask multi-counter scheme ensures that | | ||||
| | grace-period initialization will always see consistent masks up and   | | ||||
| | down the tree, which brings significant simplifications over the      | | ||||
| | single-mask method.                                                   | | ||||
| |                                                                       | | ||||
| | This is an instance of `deferring work in order to avoid              | | ||||
| | synchronization <http://www.cs.columbia.edu/~library/TR-repository/re | | ||||
| | ports/reports-1992/cucs-039-92.ps.gz>`__.                             | | ||||
| | Lazily recording CPU-hotplug events at the beginning of the next      | | ||||
| | grace period greatly simplifies maintenance of the CPU-tracking       | | ||||
| | bitmasks in the ``rcu_node`` tree.                                    | | ||||
| +-----------------------------------------------------------------------+ | ||||
| 
 | ||||
| Expedited Grace Period Refinements | ||||
| ---------------------------------- | ||||
| 
 | ||||
| Idle-CPU Checks | ||||
| ~~~~~~~~~~~~~~~ | ||||
| 
 | ||||
| Each expedited grace period checks for idle CPUs when initially forming | ||||
| the mask of CPUs to be IPIed and again just before IPIing a CPU (both | ||||
| checks are carried out by ``sync_rcu_exp_select_cpus()``). If the CPU is | ||||
| idle at any time between those two times, the CPU will not be IPIed. | ||||
| Instead, the task pushing the grace period forward will include the idle | ||||
| CPUs in the mask passed to ``rcu_report_exp_cpu_mult()``. | ||||
| 
 | ||||
| For RCU-sched, there is an additional check: If the IPI has interrupted | ||||
| the idle loop, then ``rcu_exp_handler()`` invokes | ||||
| ``rcu_report_exp_rdp()`` to report the corresponding quiescent state. | ||||
| 
 | ||||
| For RCU-preempt, there is no specific check for idle in the IPI handler | ||||
| (``rcu_exp_handler()``), but because RCU read-side critical sections are | ||||
| not permitted within the idle loop, if ``rcu_exp_handler()`` sees that | ||||
| the CPU is within RCU read-side critical section, the CPU cannot | ||||
| possibly be idle. Otherwise, ``rcu_exp_handler()`` invokes | ||||
| ``rcu_report_exp_rdp()`` to report the corresponding quiescent state, | ||||
| regardless of whether or not that quiescent state was due to the CPU | ||||
| being idle. | ||||
| 
 | ||||
| In summary, RCU expedited grace periods check for idle when building the | ||||
| bitmask of CPUs that must be IPIed, just before sending each IPI, and | ||||
| (either explicitly or implicitly) within the IPI handler. | ||||
| 
 | ||||
| Batching via Sequence Counter | ||||
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||||
| 
 | ||||
| If each grace-period request was carried out separately, expedited grace | ||||
| periods would have abysmal scalability and problematic high-load | ||||
| characteristics. Because each grace-period operation can serve an | ||||
| unlimited number of updates, it is important to *batch* requests, so | ||||
| that a single expedited grace-period operation will cover all requests | ||||
| in the corresponding batch. | ||||
| 
 | ||||
| This batching is controlled by a sequence counter named | ||||
| ``->expedited_sequence`` in the ``rcu_state`` structure. This counter | ||||
| has an odd value when there is an expedited grace period in progress and | ||||
| an even value otherwise, so that dividing the counter value by two gives | ||||
| the number of completed grace periods. During any given update request, | ||||
| the counter must transition from even to odd and then back to even, thus | ||||
| indicating that a grace period has elapsed. Therefore, if the initial | ||||
| value of the counter is ``s``, the updater must wait until the counter | ||||
| reaches at least the value ``(s+3)&~0x1``. This counter is managed by | ||||
| the following access functions: | ||||
| 
 | ||||
| #. ``rcu_exp_gp_seq_start()``, which marks the start of an expedited | ||||
|    grace period. | ||||
| #. ``rcu_exp_gp_seq_end()``, which marks the end of an expedited grace | ||||
|    period. | ||||
| #. ``rcu_exp_gp_seq_snap()``, which obtains a snapshot of the counter. | ||||
| #. ``rcu_exp_gp_seq_done()``, which returns ``true`` if a full expedited | ||||
|    grace period has elapsed since the corresponding call to | ||||
|    ``rcu_exp_gp_seq_snap()``. | ||||
| 
 | ||||
| Again, only one request in a given batch need actually carry out a | ||||
| grace-period operation, which means there must be an efficient way to | ||||
| identify which of many concurrent reqeusts will initiate the grace | ||||
| period, and that there be an efficient way for the remaining requests to | ||||
| wait for that grace period to complete. However, that is the topic of | ||||
| the next section. | ||||
| 
 | ||||
| Funnel Locking and Wait/Wakeup | ||||
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||||
| 
 | ||||
| The natural way to sort out which of a batch of updaters will initiate | ||||
| the expedited grace period is to use the ``rcu_node`` combining tree, as | ||||
| implemented by the ``exp_funnel_lock()`` function. The first updater | ||||
| corresponding to a given grace period arriving at a given ``rcu_node`` | ||||
| structure records its desired grace-period sequence number in the | ||||
| ``->exp_seq_rq`` field and moves up to the next level in the tree. | ||||
| Otherwise, if the ``->exp_seq_rq`` field already contains the sequence | ||||
| number for the desired grace period or some later one, the updater | ||||
| blocks on one of four wait queues in the ``->exp_wq[]`` array, using the | ||||
| second-from-bottom and third-from bottom bits as an index. An | ||||
| ``->exp_lock`` field in the ``rcu_node`` structure synchronizes access | ||||
| to these fields. | ||||
| 
 | ||||
| An empty ``rcu_node`` tree is shown in the following diagram, with the | ||||
| white cells representing the ``->exp_seq_rq`` field and the red cells | ||||
| representing the elements of the ``->exp_wq[]`` array. | ||||
| 
 | ||||
| .. kernel-figure:: Funnel0.svg | ||||
| 
 | ||||
| The next diagram shows the situation after the arrival of Task A and | ||||
| Task B at the leftmost and rightmost leaf ``rcu_node`` structures, | ||||
| respectively. The current value of the ``rcu_state`` structure's | ||||
| ``->expedited_sequence`` field is zero, so adding three and clearing the | ||||
| bottom bit results in the value two, which both tasks record in the | ||||
| ``->exp_seq_rq`` field of their respective ``rcu_node`` structures: | ||||
| 
 | ||||
| .. kernel-figure:: Funnel1.svg | ||||
| 
 | ||||
| Each of Tasks A and B will move up to the root ``rcu_node`` structure. | ||||
| Suppose that Task A wins, recording its desired grace-period sequence | ||||
| number and resulting in the state shown below: | ||||
| 
 | ||||
| .. kernel-figure:: Funnel2.svg | ||||
| 
 | ||||
| Task A now advances to initiate a new grace period, while Task B moves | ||||
| up to the root ``rcu_node`` structure, and, seeing that its desired | ||||
| sequence number is already recorded, blocks on ``->exp_wq[1]``. | ||||
| 
 | ||||
| +-----------------------------------------------------------------------+ | ||||
| | **Quick Quiz**:                                                       | | ||||
| +-----------------------------------------------------------------------+ | ||||
| | Why ``->exp_wq[1]``? Given that the value of these tasks' desired     | | ||||
| | sequence number is two, so shouldn't they instead block on            | | ||||
| | ``->exp_wq[2]``?                                                      | | ||||
| +-----------------------------------------------------------------------+ | ||||
| | **Answer**:                                                           | | ||||
| +-----------------------------------------------------------------------+ | ||||
| | No.                                                                   | | ||||
| | Recall that the bottom bit of the desired sequence number indicates   | | ||||
| | whether or not a grace period is currently in progress. It is         | | ||||
| | therefore necessary to shift the sequence number right one bit        | | ||||
| | position to obtain the number of the grace period. This results in    | | ||||
| | ``->exp_wq[1]``.                                                      | | ||||
| +-----------------------------------------------------------------------+ | ||||
| 
 | ||||
| If Tasks C and D also arrive at this point, they will compute the same | ||||
| desired grace-period sequence number, and see that both leaf | ||||
| ``rcu_node`` structures already have that value recorded. They will | ||||
| therefore block on their respective ``rcu_node`` structures' | ||||
| ``->exp_wq[1]`` fields, as shown below: | ||||
| 
 | ||||
| .. kernel-figure:: Funnel3.svg | ||||
| 
 | ||||
| Task A now acquires the ``rcu_state`` structure's ``->exp_mutex`` and | ||||
| initiates the grace period, which increments ``->expedited_sequence``. | ||||
| Therefore, if Tasks E and F arrive, they will compute a desired sequence | ||||
| number of 4 and will record this value as shown below: | ||||
| 
 | ||||
| .. kernel-figure:: Funnel4.svg | ||||
| 
 | ||||
| Tasks E and F will propagate up the ``rcu_node`` combining tree, with | ||||
| Task F blocking on the root ``rcu_node`` structure and Task E wait for | ||||
| Task A to finish so that it can start the next grace period. The | ||||
| resulting state is as shown below: | ||||
| 
 | ||||
| .. kernel-figure:: Funnel5.svg | ||||
| 
 | ||||
| Once the grace period completes, Task A starts waking up the tasks | ||||
| waiting for this grace period to complete, increments the | ||||
| ``->expedited_sequence``, acquires the ``->exp_wake_mutex`` and then | ||||
| releases the ``->exp_mutex``. This results in the following state: | ||||
| 
 | ||||
| .. kernel-figure:: Funnel6.svg | ||||
| 
 | ||||
| Task E can then acquire ``->exp_mutex`` and increment | ||||
| ``->expedited_sequence`` to the value three. If new tasks G and H arrive | ||||
| and moves up the combining tree at the same time, the state will be as | ||||
| follows: | ||||
| 
 | ||||
| .. kernel-figure:: Funnel7.svg | ||||
| 
 | ||||
| Note that three of the root ``rcu_node`` structure's waitqueues are now | ||||
| occupied. However, at some point, Task A will wake up the tasks blocked | ||||
| on the ``->exp_wq`` waitqueues, resulting in the following state: | ||||
| 
 | ||||
| .. kernel-figure:: Funnel8.svg | ||||
| 
 | ||||
| Execution will continue with Tasks E and H completing their grace | ||||
| periods and carrying out their wakeups. | ||||
| 
 | ||||
| +-----------------------------------------------------------------------+ | ||||
| | **Quick Quiz**:                                                       | | ||||
| +-----------------------------------------------------------------------+ | ||||
| | What happens if Task A takes so long to do its wakeups that Task E's  | | ||||
| | grace period completes?                                               | | ||||
| +-----------------------------------------------------------------------+ | ||||
| | **Answer**:                                                           | | ||||
| +-----------------------------------------------------------------------+ | ||||
| | Then Task E will block on the ``->exp_wake_mutex``, which will also   | | ||||
| | prevent it from releasing ``->exp_mutex``, which in turn will prevent | | ||||
| | the next grace period from starting. This last is important in        | | ||||
| | preventing overflow of the ``->exp_wq[]`` array.                      | | ||||
| +-----------------------------------------------------------------------+ | ||||
| 
 | ||||
| Use of Workqueues | ||||
| ~~~~~~~~~~~~~~~~~ | ||||
| 
 | ||||
| In earlier implementations, the task requesting the expedited grace | ||||
| period also drove it to completion. This straightforward approach had | ||||
| the disadvantage of needing to account for POSIX signals sent to user | ||||
| tasks, so more recent implemementations use the Linux kernel's | ||||
| `workqueues <https://www.kernel.org/doc/Documentation/core-api/workqueue.rst>`__. | ||||
| 
 | ||||
| The requesting task still does counter snapshotting and funnel-lock | ||||
| processing, but the task reaching the top of the funnel lock does a | ||||
| ``schedule_work()`` (from ``_synchronize_rcu_expedited()`` so that a | ||||
| workqueue kthread does the actual grace-period processing. Because | ||||
| workqueue kthreads do not accept POSIX signals, grace-period-wait | ||||
| processing need not allow for POSIX signals. In addition, this approach | ||||
| allows wakeups for the previous expedited grace period to be overlapped | ||||
| with processing for the next expedited grace period. Because there are | ||||
| only four sets of waitqueues, it is necessary to ensure that the | ||||
| previous grace period's wakeups complete before the next grace period's | ||||
| wakeups start. This is handled by having the ``->exp_mutex`` guard | ||||
| expedited grace-period processing and the ``->exp_wake_mutex`` guard | ||||
| wakeups. The key point is that the ``->exp_mutex`` is not released until | ||||
| the first wakeup is complete, which means that the ``->exp_wake_mutex`` | ||||
| has already been acquired at that point. This approach ensures that the | ||||
| previous grace period's wakeups can be carried out while the current | ||||
| grace period is in process, but that these wakeups will complete before | ||||
| the next grace period starts. This means that only three waitqueues are | ||||
| required, guaranteeing that the four that are provided are sufficient. | ||||
| 
 | ||||
| Stall Warnings | ||||
| ~~~~~~~~~~~~~~ | ||||
| 
 | ||||
| Expediting grace periods does nothing to speed things up when RCU | ||||
| readers take too long, and therefore expedited grace periods check for | ||||
| stalls just as normal grace periods do. | ||||
| 
 | ||||
| +-----------------------------------------------------------------------+ | ||||
| | **Quick Quiz**:                                                       | | ||||
| +-----------------------------------------------------------------------+ | ||||
| | But why not just let the normal grace-period machinery detect the     | | ||||
| | stalls, given that a given reader must block both normal and          | | ||||
| | expedited grace periods?                                              | | ||||
| +-----------------------------------------------------------------------+ | ||||
| | **Answer**:                                                           | | ||||
| +-----------------------------------------------------------------------+ | ||||
| | Because it is quite possible that at a given time there is no normal  | | ||||
| | grace period in progress, in which case the normal grace period       | | ||||
| | cannot emit a stall warning.                                          | | ||||
| +-----------------------------------------------------------------------+ | ||||
| 
 | ||||
| The ``synchronize_sched_expedited_wait()`` function loops waiting for | ||||
| the expedited grace period to end, but with a timeout set to the current | ||||
| RCU CPU stall-warning time. If this time is exceeded, any CPUs or | ||||
| ``rcu_node`` structures blocking the current grace period are printed. | ||||
| Each stall warning results in another pass through the loop, but the | ||||
| second and subsequent passes use longer stall times. | ||||
| 
 | ||||
| Mid-boot operation | ||||
| ~~~~~~~~~~~~~~~~~~ | ||||
| 
 | ||||
| The use of workqueues has the advantage that the expedited grace-period | ||||
| code need not worry about POSIX signals. Unfortunately, it has the | ||||
| corresponding disadvantage that workqueues cannot be used until they are | ||||
| initialized, which does not happen until some time after the scheduler | ||||
| spawns the first task. Given that there are parts of the kernel that | ||||
| really do want to execute grace periods during this mid-boot “dead | ||||
| zone”, expedited grace periods must do something else during thie time. | ||||
| 
 | ||||
| What they do is to fall back to the old practice of requiring that the | ||||
| requesting task drive the expedited grace period, as was the case before | ||||
| the use of workqueues. However, the requesting task is only required to | ||||
| drive the grace period during the mid-boot dead zone. Before mid-boot, a | ||||
| synchronous grace period is a no-op. Some time after mid-boot, | ||||
| workqueues are used. | ||||
| 
 | ||||
| Non-expedited non-SRCU synchronous grace periods must also operate | ||||
| normally during mid-boot. This is handled by causing non-expedited grace | ||||
| periods to take the expedited code path during mid-boot. | ||||
| 
 | ||||
| The current code assumes that there are no POSIX signals during the | ||||
| mid-boot dead zone. However, if an overwhelming need for POSIX signals | ||||
| somehow arises, appropriate adjustments can be made to the expedited | ||||
| stall-warning code. One such adjustment would reinstate the | ||||
| pre-workqueue stall-warning checks, but only during the mid-boot dead | ||||
| zone. | ||||
| 
 | ||||
| With this refinement, synchronous grace periods can now be used from | ||||
| task context pretty much any time during the life of the kernel. That | ||||
| is, aside from some points in the suspend, hibernate, or shutdown code | ||||
| path. | ||||
| 
 | ||||
| Summary | ||||
| ~~~~~~~ | ||||
| 
 | ||||
| Expedited grace periods use a sequence-number approach to promote | ||||
| batching, so that a single grace-period operation can serve numerous | ||||
| requests. A funnel lock is used to efficiently identify the one task out | ||||
| of a concurrent group that will request the grace period. All members of | ||||
| the group will block on waitqueues provided in the ``rcu_node`` | ||||
| structure. The actual grace-period processing is carried out by a | ||||
| workqueue. | ||||
| 
 | ||||
| CPU-hotplug operations are noted lazily in order to prevent the need for | ||||
| tight synchronization between expedited grace periods and CPU-hotplug | ||||
| operations. The dyntick-idle counters are used to avoid sending IPIs to | ||||
| idle CPUs, at least in the common case. RCU-preempt and RCU-sched use | ||||
| different IPI handlers and different code to respond to the state | ||||
| changes carried out by those handlers, but otherwise use common code. | ||||
| 
 | ||||
| Quiescent states are tracked using the ``rcu_node`` tree, and once all | ||||
| necessary quiescent states have been reported, all tasks waiting on this | ||||
| expedited grace period are awakened. A pair of mutexes are used to allow | ||||
| one grace period's wakeups to proceed concurrently with the next grace | ||||
| period's processing. | ||||
| 
 | ||||
| This combination of mechanisms allows expedited grace periods to run | ||||
| reasonably efficiently. However, for non-time-critical tasks, normal | ||||
| grace periods should be used instead because their longer duration | ||||
| permits much higher degrees of batching, and thus much lower per-request | ||||
| overheads. | ||||
|  | @ -1,9 +0,0 @@ | |||
| <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" | ||||
|         "http://www.w3.org/TR/html4/loose.dtd"> | ||||
|         <html> | ||||
|         <head><title>A Diagram of TREE_RCU's Grace-Period Memory Ordering</title> | ||||
|         <meta HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1"> | ||||
| 
 | ||||
| <p><img src="TreeRCU-gp.svg" alt="TreeRCU-gp.svg"> | ||||
| 
 | ||||
| </body></html> | ||||
|  | @ -1,704 +0,0 @@ | |||
| <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" | ||||
|         "http://www.w3.org/TR/html4/loose.dtd"> | ||||
|         <html> | ||||
|         <head><title>A Tour Through TREE_RCU's Grace-Period Memory Ordering</title> | ||||
|         <meta HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1"> | ||||
| 
 | ||||
|            <p>August 8, 2017</p> | ||||
|            <p>This article was contributed by Paul E. McKenney</p> | ||||
| 
 | ||||
| <h3>Introduction</h3> | ||||
| 
 | ||||
| <p>This document gives a rough visual overview of how Tree RCU's | ||||
| grace-period memory ordering guarantee is provided. | ||||
| 
 | ||||
| <ol> | ||||
| <li>	<a href="#What Is Tree RCU's Grace Period Memory Ordering Guarantee?"> | ||||
| 	What Is Tree RCU's Grace Period Memory Ordering Guarantee?</a> | ||||
| <li>	<a href="#Tree RCU Grace Period Memory Ordering Building Blocks"> | ||||
| 	Tree RCU Grace Period Memory Ordering Building Blocks</a> | ||||
| <li>	<a href="#Tree RCU Grace Period Memory Ordering Components"> | ||||
| 	Tree RCU Grace Period Memory Ordering Components</a> | ||||
| <li>	<a href="#Putting It All Together">Putting It All Together</a> | ||||
| </ol> | ||||
| 
 | ||||
| <h3><a name="What Is Tree RCU's Grace Period Memory Ordering Guarantee?"> | ||||
| What Is Tree RCU's Grace Period Memory Ordering Guarantee?</a></h3> | ||||
| 
 | ||||
| <p>RCU grace periods provide extremely strong memory-ordering guarantees | ||||
| for non-idle non-offline code. | ||||
| Any code that happens after the end of a given RCU grace period is guaranteed | ||||
| to see the effects of all accesses prior to the beginning of that grace | ||||
| period that are within RCU read-side critical sections. | ||||
| Similarly, any code that happens before the beginning of a given RCU grace | ||||
| period is guaranteed to see the effects of all accesses following the end | ||||
| of that grace period that are within RCU read-side critical sections. | ||||
| 
 | ||||
| <p>Note well that RCU-sched read-side critical sections include any region | ||||
| of code for which preemption is disabled. | ||||
| Given that each individual machine instruction can be thought of as | ||||
| an extremely small region of preemption-disabled code, one can think of | ||||
| <tt>synchronize_rcu()</tt> as <tt>smp_mb()</tt> on steroids. | ||||
| 
 | ||||
| <p>RCU updaters use this guarantee by splitting their updates into | ||||
| two phases, one of which is executed before the grace period and | ||||
| the other of which is executed after the grace period. | ||||
| In the most common use case, phase one removes an element from | ||||
| a linked RCU-protected data structure, and phase two frees that element. | ||||
| For this to work, any readers that have witnessed state prior to the | ||||
| phase-one update (in the common case, removal) must not witness state | ||||
| following the phase-two update (in the common case, freeing). | ||||
| 
 | ||||
| <p>The RCU implementation provides this guarantee using a network | ||||
| of lock-based critical sections, memory barriers, and per-CPU | ||||
| processing, as is described in the following sections. | ||||
| 
 | ||||
| <h3><a name="Tree RCU Grace Period Memory Ordering Building Blocks"> | ||||
| Tree RCU Grace Period Memory Ordering Building Blocks</a></h3> | ||||
| 
 | ||||
| <p>The workhorse for RCU's grace-period memory ordering is the | ||||
| critical section for the <tt>rcu_node</tt> structure's | ||||
| <tt>->lock</tt>. | ||||
| These critical sections use helper functions for lock acquisition, including | ||||
| <tt>raw_spin_lock_rcu_node()</tt>, | ||||
| <tt>raw_spin_lock_irq_rcu_node()</tt>, and | ||||
| <tt>raw_spin_lock_irqsave_rcu_node()</tt>. | ||||
| Their lock-release counterparts are | ||||
| <tt>raw_spin_unlock_rcu_node()</tt>, | ||||
| <tt>raw_spin_unlock_irq_rcu_node()</tt>, and | ||||
| <tt>raw_spin_unlock_irqrestore_rcu_node()</tt>, | ||||
| respectively. | ||||
| For completeness, a | ||||
| <tt>raw_spin_trylock_rcu_node()</tt> | ||||
| is also provided. | ||||
| The key point is that the lock-acquisition functions, including | ||||
| <tt>raw_spin_trylock_rcu_node()</tt>, all invoke | ||||
| <tt>smp_mb__after_unlock_lock()</tt> immediately after successful | ||||
| acquisition of the lock. | ||||
| 
 | ||||
| <p>Therefore, for any given <tt>rcu_node</tt> structure, any access | ||||
| happening before one of the above lock-release functions will be seen | ||||
| by all CPUs as happening before any access happening after a later | ||||
| one of the above lock-acquisition functions. | ||||
| Furthermore, any access happening before one of the | ||||
| above lock-release function on any given CPU will be seen by all | ||||
| CPUs as happening before any access happening after a later one | ||||
| of the above lock-acquisition functions executing on that same CPU, | ||||
| even if the lock-release and lock-acquisition functions are operating | ||||
| on different <tt>rcu_node</tt> structures. | ||||
| Tree RCU uses these two ordering guarantees to form an ordering | ||||
| network among all CPUs that were in any way involved in the grace | ||||
| period, including any CPUs that came online or went offline during | ||||
| the grace period in question. | ||||
| 
 | ||||
| <p>The following litmus test exhibits the ordering effects of these | ||||
| lock-acquisition and lock-release functions: | ||||
| 
 | ||||
| <pre> | ||||
|  1 int x, y, z; | ||||
|  2 | ||||
|  3 void task0(void) | ||||
|  4 { | ||||
|  5   raw_spin_lock_rcu_node(rnp); | ||||
|  6   WRITE_ONCE(x, 1); | ||||
|  7   r1 = READ_ONCE(y); | ||||
|  8   raw_spin_unlock_rcu_node(rnp); | ||||
|  9 } | ||||
| 10 | ||||
| 11 void task1(void) | ||||
| 12 { | ||||
| 13   raw_spin_lock_rcu_node(rnp); | ||||
| 14   WRITE_ONCE(y, 1); | ||||
| 15   r2 = READ_ONCE(z); | ||||
| 16   raw_spin_unlock_rcu_node(rnp); | ||||
| 17 } | ||||
| 18 | ||||
| 19 void task2(void) | ||||
| 20 { | ||||
| 21   WRITE_ONCE(z, 1); | ||||
| 22   smp_mb(); | ||||
| 23   r3 = READ_ONCE(x); | ||||
| 24 } | ||||
| 25 | ||||
| 26 WARN_ON(r1 == 0 && r2 == 0 && r3 == 0); | ||||
| </pre> | ||||
| 
 | ||||
| <p>The <tt>WARN_ON()</tt> is evaluated at “the end of time”, | ||||
| after all changes have propagated throughout the system. | ||||
| Without the <tt>smp_mb__after_unlock_lock()</tt> provided by the | ||||
| acquisition functions, this <tt>WARN_ON()</tt> could trigger, for example | ||||
| on PowerPC. | ||||
| The <tt>smp_mb__after_unlock_lock()</tt> invocations prevent this | ||||
| <tt>WARN_ON()</tt> from triggering. | ||||
| 
 | ||||
| <p>This approach must be extended to include idle CPUs, which need | ||||
| RCU's grace-period memory ordering guarantee to extend to any | ||||
| RCU read-side critical sections preceding and following the current | ||||
| idle sojourn. | ||||
| This case is handled by calls to the strongly ordered | ||||
| <tt>atomic_add_return()</tt> read-modify-write atomic operation that | ||||
| is invoked within <tt>rcu_dynticks_eqs_enter()</tt> at idle-entry | ||||
| time and within <tt>rcu_dynticks_eqs_exit()</tt> at idle-exit time. | ||||
| The grace-period kthread invokes <tt>rcu_dynticks_snap()</tt> and | ||||
| <tt>rcu_dynticks_in_eqs_since()</tt> (both of which invoke | ||||
| an <tt>atomic_add_return()</tt> of zero) to detect idle CPUs. | ||||
| 
 | ||||
| <table> | ||||
| <tr><th> </th></tr> | ||||
| <tr><th align="left">Quick Quiz:</th></tr> | ||||
| <tr><td> | ||||
| 	But what about CPUs that remain offline for the entire | ||||
| 	grace period? | ||||
| </td></tr> | ||||
| <tr><th align="left">Answer:</th></tr> | ||||
| <tr><td bgcolor="#ffffff"><font color="ffffff"> | ||||
| 	Such CPUs will be offline at the beginning of the grace period, | ||||
| 	so the grace period won't expect quiescent states from them. | ||||
| 	Races between grace-period start and CPU-hotplug operations | ||||
| 	are mediated by the CPU's leaf <tt>rcu_node</tt> structure's | ||||
| 	<tt>->lock</tt> as described above. | ||||
| </font></td></tr> | ||||
| <tr><td> </td></tr> | ||||
| </table> | ||||
| 
 | ||||
| <p>The approach must be extended to handle one final case, that | ||||
| of waking a task blocked in <tt>synchronize_rcu()</tt>. | ||||
| This task might be affinitied to a CPU that is not yet aware that | ||||
| the grace period has ended, and thus might not yet be subject to | ||||
| the grace period's memory ordering. | ||||
| Therefore, there is an <tt>smp_mb()</tt> after the return from | ||||
| <tt>wait_for_completion()</tt> in the <tt>synchronize_rcu()</tt> | ||||
| code path. | ||||
| 
 | ||||
| <table> | ||||
| <tr><th> </th></tr> | ||||
| <tr><th align="left">Quick Quiz:</th></tr> | ||||
| <tr><td> | ||||
| 	What?  Where??? | ||||
| 	I don't see any <tt>smp_mb()</tt> after the return from | ||||
| 	<tt>wait_for_completion()</tt>!!! | ||||
| </td></tr> | ||||
| <tr><th align="left">Answer:</th></tr> | ||||
| <tr><td bgcolor="#ffffff"><font color="ffffff"> | ||||
| 	That would be because I spotted the need for that | ||||
| 	<tt>smp_mb()</tt> during the creation of this documentation, | ||||
| 	and it is therefore unlikely to hit mainline before v4.14. | ||||
| 	Kudos to Lance Roy, Will Deacon, Peter Zijlstra, and | ||||
| 	Jonathan Cameron for asking questions that sensitized me | ||||
| 	to the rather elaborate sequence of events that demonstrate | ||||
| 	the need for this memory barrier. | ||||
| </font></td></tr> | ||||
| <tr><td> </td></tr> | ||||
| </table> | ||||
| 
 | ||||
| <p>Tree RCU's grace--period memory-ordering guarantees rely most | ||||
| heavily on the <tt>rcu_node</tt> structure's <tt>->lock</tt> | ||||
| field, so much so that it is necessary to abbreviate this pattern | ||||
| in the diagrams in the next section. | ||||
| For example, consider the <tt>rcu_prepare_for_idle()</tt> function | ||||
| shown below, which is one of several functions that enforce ordering | ||||
| of newly arrived RCU callbacks against future grace periods: | ||||
| 
 | ||||
| <pre> | ||||
|  1 static void rcu_prepare_for_idle(void) | ||||
|  2 { | ||||
|  3   bool needwake; | ||||
|  4   struct rcu_data *rdp; | ||||
|  5   struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks); | ||||
|  6   struct rcu_node *rnp; | ||||
|  7   struct rcu_state *rsp; | ||||
|  8   int tne; | ||||
|  9 | ||||
| 10   if (IS_ENABLED(CONFIG_RCU_NOCB_CPU_ALL) || | ||||
| 11       rcu_is_nocb_cpu(smp_processor_id())) | ||||
| 12     return; | ||||
| 13   tne = READ_ONCE(tick_nohz_active); | ||||
| 14   if (tne != rdtp->tick_nohz_enabled_snap) { | ||||
| 15     if (rcu_cpu_has_callbacks(NULL)) | ||||
| 16       invoke_rcu_core(); | ||||
| 17     rdtp->tick_nohz_enabled_snap = tne; | ||||
| 18     return; | ||||
| 19   } | ||||
| 20   if (!tne) | ||||
| 21     return; | ||||
| 22   if (rdtp->all_lazy && | ||||
| 23       rdtp->nonlazy_posted != rdtp->nonlazy_posted_snap) { | ||||
| 24     rdtp->all_lazy = false; | ||||
| 25     rdtp->nonlazy_posted_snap = rdtp->nonlazy_posted; | ||||
| 26     invoke_rcu_core(); | ||||
| 27     return; | ||||
| 28   } | ||||
| 29   if (rdtp->last_accelerate == jiffies) | ||||
| 30     return; | ||||
| 31   rdtp->last_accelerate = jiffies; | ||||
| 32   for_each_rcu_flavor(rsp) { | ||||
| 33     rdp = this_cpu_ptr(rsp->rda); | ||||
| 34     if (rcu_segcblist_pend_cbs(&rdp->cblist)) | ||||
| 35       continue; | ||||
| 36     rnp = rdp->mynode; | ||||
| 37     raw_spin_lock_rcu_node(rnp); | ||||
| 38     needwake = rcu_accelerate_cbs(rsp, rnp, rdp); | ||||
| 39     raw_spin_unlock_rcu_node(rnp); | ||||
| 40     if (needwake) | ||||
| 41       rcu_gp_kthread_wake(rsp); | ||||
| 42   } | ||||
| 43 } | ||||
| </pre> | ||||
| 
 | ||||
| <p>But the only part of <tt>rcu_prepare_for_idle()</tt> that really | ||||
| matters for this discussion are lines 37–39. | ||||
| We will therefore abbreviate this function as follows: | ||||
| 
 | ||||
| </p><p><img src="rcu_node-lock.svg" alt="rcu_node-lock.svg"> | ||||
| 
 | ||||
| <p>The box represents the <tt>rcu_node</tt> structure's <tt>->lock</tt> | ||||
| critical section, with the double line on top representing the additional | ||||
| <tt>smp_mb__after_unlock_lock()</tt>. | ||||
| 
 | ||||
| <h3><a name="Tree RCU Grace Period Memory Ordering Components"> | ||||
| Tree RCU Grace Period Memory Ordering Components</a></h3> | ||||
| 
 | ||||
| <p>Tree RCU's grace-period memory-ordering guarantee is provided by | ||||
| a number of RCU components: | ||||
| 
 | ||||
| <ol> | ||||
| <li>	<a href="#Callback Registry">Callback Registry</a> | ||||
| <li>	<a href="#Grace-Period Initialization">Grace-Period Initialization</a> | ||||
| <li>	<a href="#Self-Reported Quiescent States"> | ||||
| 	Self-Reported Quiescent States</a> | ||||
| <li>	<a href="#Dynamic Tick Interface">Dynamic Tick Interface</a> | ||||
| <li>	<a href="#CPU-Hotplug Interface">CPU-Hotplug Interface</a> | ||||
| <li>	<a href="Forcing Quiescent States">Forcing Quiescent States</a> | ||||
| <li>	<a href="Grace-Period Cleanup">Grace-Period Cleanup</a> | ||||
| <li>	<a href="Callback Invocation">Callback Invocation</a> | ||||
| </ol> | ||||
| 
 | ||||
| <p>Each of the following section looks at the corresponding component | ||||
| in detail. | ||||
| 
 | ||||
| <h4><a name="Callback Registry">Callback Registry</a></h4> | ||||
| 
 | ||||
| <p>If RCU's grace-period guarantee is to mean anything at all, any | ||||
| access that happens before a given invocation of <tt>call_rcu()</tt> | ||||
| must also happen before the corresponding grace period. | ||||
| The implementation of this portion of RCU's grace period guarantee | ||||
| is shown in the following figure: | ||||
| 
 | ||||
| </p><p><img src="TreeRCU-callback-registry.svg" alt="TreeRCU-callback-registry.svg"> | ||||
| 
 | ||||
| <p>Because <tt>call_rcu()</tt> normally acts only on CPU-local state, | ||||
| it provides no ordering guarantees, either for itself or for | ||||
| phase one of the update (which again will usually be removal of | ||||
| an element from an RCU-protected data structure). | ||||
| It simply enqueues the <tt>rcu_head</tt> structure on a per-CPU list, | ||||
| which cannot become associated with a grace period until a later | ||||
| call to <tt>rcu_accelerate_cbs()</tt>, as shown in the diagram above. | ||||
| 
 | ||||
| <p>One set of code paths shown on the left invokes | ||||
| <tt>rcu_accelerate_cbs()</tt> via | ||||
| <tt>note_gp_changes()</tt>, either directly from <tt>call_rcu()</tt> (if | ||||
| the current CPU is inundated with queued <tt>rcu_head</tt> structures) | ||||
| or more likely from an <tt>RCU_SOFTIRQ</tt> handler. | ||||
| Another code path in the middle is taken only in kernels built with | ||||
| <tt>CONFIG_RCU_FAST_NO_HZ=y</tt>, which invokes | ||||
| <tt>rcu_accelerate_cbs()</tt> via <tt>rcu_prepare_for_idle()</tt>. | ||||
| The final code path on the right is taken only in kernels built with | ||||
| <tt>CONFIG_HOTPLUG_CPU=y</tt>, which invokes | ||||
| <tt>rcu_accelerate_cbs()</tt> via | ||||
| <tt>rcu_advance_cbs()</tt>, <tt>rcu_migrate_callbacks</tt>, | ||||
| <tt>rcutree_migrate_callbacks()</tt>, and <tt>takedown_cpu()</tt>, | ||||
| which in turn is invoked on a surviving CPU after the outgoing | ||||
| CPU has been completely offlined. | ||||
| 
 | ||||
| <p>There are a few other code paths within grace-period processing | ||||
| that opportunistically invoke <tt>rcu_accelerate_cbs()</tt>. | ||||
| However, either way, all of the CPU's recently queued <tt>rcu_head</tt> | ||||
| structures are associated with a future grace-period number under | ||||
| the protection of the CPU's lead <tt>rcu_node</tt> structure's | ||||
| <tt>->lock</tt>. | ||||
| In all cases, there is full ordering against any prior critical section | ||||
| for that same <tt>rcu_node</tt> structure's <tt>->lock</tt>, and | ||||
| also full ordering against any of the current task's or CPU's prior critical | ||||
| sections for any <tt>rcu_node</tt> structure's <tt>->lock</tt>. | ||||
| 
 | ||||
| <p>The next section will show how this ordering ensures that any | ||||
| accesses prior to the <tt>call_rcu()</tt> (particularly including phase | ||||
| one of the update) | ||||
| happen before the start of the corresponding grace period. | ||||
| 
 | ||||
| <table> | ||||
| <tr><th> </th></tr> | ||||
| <tr><th align="left">Quick Quiz:</th></tr> | ||||
| <tr><td> | ||||
| 	But what about <tt>synchronize_rcu()</tt>? | ||||
| </td></tr> | ||||
| <tr><th align="left">Answer:</th></tr> | ||||
| <tr><td bgcolor="#ffffff"><font color="ffffff"> | ||||
| 	The <tt>synchronize_rcu()</tt> passes <tt>call_rcu()</tt> | ||||
| 	to <tt>wait_rcu_gp()</tt>, which invokes it. | ||||
| 	So either way, it eventually comes down to <tt>call_rcu()</tt>. | ||||
| </font></td></tr> | ||||
| <tr><td> </td></tr> | ||||
| </table> | ||||
| 
 | ||||
| <h4><a name="Grace-Period Initialization">Grace-Period Initialization</a></h4> | ||||
| 
 | ||||
| <p>Grace-period initialization is carried out by | ||||
| the grace-period kernel thread, which makes several passes over the | ||||
| <tt>rcu_node</tt> tree within the <tt>rcu_gp_init()</tt> function. | ||||
| This means that showing the full flow of ordering through the | ||||
| grace-period computation will require duplicating this tree. | ||||
| If you find this confusing, please note that the state of the | ||||
| <tt>rcu_node</tt> changes over time, just like Heraclitus's river. | ||||
| However, to keep the <tt>rcu_node</tt> river tractable, the | ||||
| grace-period kernel thread's traversals are presented in multiple | ||||
| parts, starting in this section with the various phases of | ||||
| grace-period initialization. | ||||
| 
 | ||||
| <p>The first ordering-related grace-period initialization action is to | ||||
| advance the <tt>rcu_state</tt> structure's <tt>->gp_seq</tt> | ||||
| grace-period-number counter, as shown below: | ||||
| 
 | ||||
| </p><p><img src="TreeRCU-gp-init-1.svg" alt="TreeRCU-gp-init-1.svg" width="75%"> | ||||
| 
 | ||||
| <p>The actual increment is carried out using <tt>smp_store_release()</tt>, | ||||
| which helps reject false-positive RCU CPU stall detection. | ||||
| Note that only the root <tt>rcu_node</tt> structure is touched. | ||||
| 
 | ||||
| <p>The first pass through the <tt>rcu_node</tt> tree updates bitmasks | ||||
| based on CPUs having come online or gone offline since the start of | ||||
| the previous grace period. | ||||
| In the common case where the number of online CPUs for this <tt>rcu_node</tt> | ||||
| structure has not transitioned to or from zero, | ||||
| this pass will scan only the leaf <tt>rcu_node</tt> structures. | ||||
| However, if the number of online CPUs for a given leaf <tt>rcu_node</tt> | ||||
| structure has transitioned from zero, | ||||
| <tt>rcu_init_new_rnp()</tt> will be invoked for the first incoming CPU. | ||||
| Similarly, if the number of online CPUs for a given leaf <tt>rcu_node</tt> | ||||
| structure has transitioned to zero, | ||||
| <tt>rcu_cleanup_dead_rnp()</tt> will be invoked for the last outgoing CPU. | ||||
| The diagram below shows the path of ordering if the leftmost | ||||
| <tt>rcu_node</tt> structure onlines its first CPU and if the next | ||||
| <tt>rcu_node</tt> structure has no online CPUs | ||||
| (or, alternatively if the leftmost <tt>rcu_node</tt> structure offlines | ||||
| its last CPU and if the next <tt>rcu_node</tt> structure has no online CPUs). | ||||
| 
 | ||||
| </p><p><img src="TreeRCU-gp-init-2.svg" alt="TreeRCU-gp-init-1.svg" width="75%"> | ||||
| 
 | ||||
| <p>The final <tt>rcu_gp_init()</tt> pass through the <tt>rcu_node</tt> | ||||
| tree traverses breadth-first, setting each <tt>rcu_node</tt> structure's | ||||
| <tt>->gp_seq</tt> field to the newly advanced value from the | ||||
| <tt>rcu_state</tt> structure, as shown in the following diagram. | ||||
| 
 | ||||
| </p><p><img src="TreeRCU-gp-init-3.svg" alt="TreeRCU-gp-init-1.svg" width="75%"> | ||||
| 
 | ||||
| <p>This change will also cause each CPU's next call to | ||||
| <tt>__note_gp_changes()</tt> | ||||
| to notice that a new grace period has started, as described in the next | ||||
| section. | ||||
| But because the grace-period kthread started the grace period at the | ||||
| root (with the advancing of the <tt>rcu_state</tt> structure's | ||||
| <tt>->gp_seq</tt> field) before setting each leaf <tt>rcu_node</tt> | ||||
| structure's <tt>->gp_seq</tt> field, each CPU's observation of | ||||
| the start of the grace period will happen after the actual start | ||||
| of the grace period. | ||||
| 
 | ||||
| <table> | ||||
| <tr><th> </th></tr> | ||||
| <tr><th align="left">Quick Quiz:</th></tr> | ||||
| <tr><td> | ||||
| 	But what about the CPU that started the grace period? | ||||
| 	Why wouldn't it see the start of the grace period right when | ||||
| 	it started that grace period? | ||||
| </td></tr> | ||||
| <tr><th align="left">Answer:</th></tr> | ||||
| <tr><td bgcolor="#ffffff"><font color="ffffff"> | ||||
| 	In some deep philosophical and overly anthromorphized | ||||
| 	sense, yes, the CPU starting the grace period is immediately | ||||
| 	aware of having done so. | ||||
| 	However, if we instead assume that RCU is not self-aware, | ||||
| 	then even the CPU starting the grace period does not really | ||||
| 	become aware of the start of this grace period until its | ||||
| 	first call to <tt>__note_gp_changes()</tt>. | ||||
| 	On the other hand, this CPU potentially gets early notification | ||||
| 	because it invokes <tt>__note_gp_changes()</tt> during its | ||||
| 	last <tt>rcu_gp_init()</tt> pass through its leaf | ||||
| 	<tt>rcu_node</tt> structure. | ||||
| </font></td></tr> | ||||
| <tr><td> </td></tr> | ||||
| </table> | ||||
| 
 | ||||
| <h4><a name="Self-Reported Quiescent States"> | ||||
| Self-Reported Quiescent States</a></h4> | ||||
| 
 | ||||
| <p>When all entities that might block the grace period have reported | ||||
| quiescent states (or as described in a later section, had quiescent | ||||
| states reported on their behalf), the grace period can end. | ||||
| Online non-idle CPUs report their own quiescent states, as shown | ||||
| in the following diagram: | ||||
| 
 | ||||
| </p><p><img src="TreeRCU-qs.svg" alt="TreeRCU-qs.svg" width="75%"> | ||||
| 
 | ||||
| <p>This is for the last CPU to report a quiescent state, which signals | ||||
| the end of the grace period. | ||||
| Earlier quiescent states would push up the <tt>rcu_node</tt> tree | ||||
| only until they encountered an <tt>rcu_node</tt> structure that | ||||
| is waiting for additional quiescent states. | ||||
| However, ordering is nevertheless preserved because some later quiescent | ||||
| state will acquire that <tt>rcu_node</tt> structure's <tt>->lock</tt>. | ||||
| 
 | ||||
| <p>Any number of events can lead up to a CPU invoking | ||||
| <tt>note_gp_changes</tt> (or alternatively, directly invoking | ||||
| <tt>__note_gp_changes()</tt>), at which point that CPU will notice | ||||
| the start of a new grace period while holding its leaf | ||||
| <tt>rcu_node</tt> lock. | ||||
| Therefore, all execution shown in this diagram happens after the | ||||
| start of the grace period. | ||||
| In addition, this CPU will consider any RCU read-side critical | ||||
| section that started before the invocation of <tt>__note_gp_changes()</tt> | ||||
| to have started before the grace period, and thus a critical | ||||
| section that the grace period must wait on. | ||||
| 
 | ||||
| <table> | ||||
| <tr><th> </th></tr> | ||||
| <tr><th align="left">Quick Quiz:</th></tr> | ||||
| <tr><td> | ||||
| 	But a RCU read-side critical section might have started | ||||
| 	after the beginning of the grace period | ||||
| 	(the advancing of <tt>->gp_seq</tt> from earlier), so why should | ||||
| 	the grace period wait on such a critical section? | ||||
| </td></tr> | ||||
| <tr><th align="left">Answer:</th></tr> | ||||
| <tr><td bgcolor="#ffffff"><font color="ffffff"> | ||||
| 	It is indeed not necessary for the grace period to wait on such | ||||
| 	a critical section. | ||||
| 	However, it is permissible to wait on it. | ||||
| 	And it is furthermore important to wait on it, as this | ||||
| 	lazy approach is far more scalable than a “big bang” | ||||
| 	all-at-once grace-period start could possibly be. | ||||
| </font></td></tr> | ||||
| <tr><td> </td></tr> | ||||
| </table> | ||||
| 
 | ||||
| <p>If the CPU does a context switch, a quiescent state will be | ||||
| noted by <tt>rcu_node_context_switch()</tt> on the left. | ||||
| On the other hand, if the CPU takes a scheduler-clock interrupt | ||||
| while executing in usermode, a quiescent state will be noted by | ||||
| <tt>rcu_sched_clock_irq()</tt> on the right. | ||||
| Either way, the passage through a quiescent state will be noted | ||||
| in a per-CPU variable. | ||||
| 
 | ||||
| <p>The next time an <tt>RCU_SOFTIRQ</tt> handler executes on | ||||
| this CPU (for example, after the next scheduler-clock | ||||
| interrupt), <tt>rcu_core()</tt> will invoke | ||||
| <tt>rcu_check_quiescent_state()</tt>, which will notice the | ||||
| recorded quiescent state, and invoke | ||||
| <tt>rcu_report_qs_rdp()</tt>. | ||||
| If <tt>rcu_report_qs_rdp()</tt> verifies that the quiescent state | ||||
| really does apply to the current grace period, it invokes | ||||
| <tt>rcu_report_rnp()</tt> which traverses up the <tt>rcu_node</tt> | ||||
| tree as shown at the bottom of the diagram, clearing bits from | ||||
| each <tt>rcu_node</tt> structure's <tt>->qsmask</tt> field, | ||||
| and propagating up the tree when the result is zero. | ||||
| 
 | ||||
| <p>Note that traversal passes upwards out of a given <tt>rcu_node</tt> | ||||
| structure only if the current CPU is reporting the last quiescent | ||||
| state for the subtree headed by that <tt>rcu_node</tt> structure. | ||||
| A key point is that if a CPU's traversal stops at a given <tt>rcu_node</tt> | ||||
| structure, then there will be a later traversal by another CPU | ||||
| (or perhaps the same one) that proceeds upwards | ||||
| from that point, and the <tt>rcu_node</tt> <tt>->lock</tt> | ||||
| guarantees that the first CPU's quiescent state happens before the | ||||
| remainder of the second CPU's traversal. | ||||
| Applying this line of thought repeatedly shows that all CPUs' | ||||
| quiescent states happen before the last CPU traverses through | ||||
| the root <tt>rcu_node</tt> structure, the “last CPU” | ||||
| being the one that clears the last bit in the root <tt>rcu_node</tt> | ||||
| structure's <tt>->qsmask</tt> field. | ||||
| 
 | ||||
| <h4><a name="Dynamic Tick Interface">Dynamic Tick Interface</a></h4> | ||||
| 
 | ||||
| <p>Due to energy-efficiency considerations, RCU is forbidden from | ||||
| disturbing idle CPUs. | ||||
| CPUs are therefore required to notify RCU when entering or leaving idle | ||||
| state, which they do via fully ordered value-returning atomic operations | ||||
| on a per-CPU variable. | ||||
| The ordering effects are as shown below: | ||||
| 
 | ||||
| </p><p><img src="TreeRCU-dyntick.svg" alt="TreeRCU-dyntick.svg" width="50%"> | ||||
| 
 | ||||
| <p>The RCU grace-period kernel thread samples the per-CPU idleness | ||||
| variable while holding the corresponding CPU's leaf <tt>rcu_node</tt> | ||||
| structure's <tt>->lock</tt>. | ||||
| This means that any RCU read-side critical sections that precede the | ||||
| idle period (the oval near the top of the diagram above) will happen | ||||
| before the end of the current grace period. | ||||
| Similarly, the beginning of the current grace period will happen before | ||||
| any RCU read-side critical sections that follow the | ||||
| idle period (the oval near the bottom of the diagram above). | ||||
| 
 | ||||
| <p>Plumbing this into the full grace-period execution is described | ||||
| <a href="#Forcing Quiescent States">below</a>. | ||||
| 
 | ||||
| <h4><a name="CPU-Hotplug Interface">CPU-Hotplug Interface</a></h4> | ||||
| 
 | ||||
| <p>RCU is also forbidden from disturbing offline CPUs, which might well | ||||
| be powered off and removed from the system completely. | ||||
| CPUs are therefore required to notify RCU of their comings and goings | ||||
| as part of the corresponding CPU hotplug operations. | ||||
| The ordering effects are shown below: | ||||
| 
 | ||||
| </p><p><img src="TreeRCU-hotplug.svg" alt="TreeRCU-hotplug.svg" width="50%"> | ||||
| 
 | ||||
| <p>Because CPU hotplug operations are much less frequent than idle transitions, | ||||
| they are heavier weight, and thus acquire the CPU's leaf <tt>rcu_node</tt> | ||||
| structure's <tt>->lock</tt> and update this structure's | ||||
| <tt>->qsmaskinitnext</tt>. | ||||
| The RCU grace-period kernel thread samples this mask to detect CPUs | ||||
| having gone offline since the beginning of this grace period. | ||||
| 
 | ||||
| <p>Plumbing this into the full grace-period execution is described | ||||
| <a href="#Forcing Quiescent States">below</a>. | ||||
| 
 | ||||
| <h4><a name="Forcing Quiescent States">Forcing Quiescent States</a></h4> | ||||
| 
 | ||||
| <p>As noted above, idle and offline CPUs cannot report their own | ||||
| quiescent states, and therefore the grace-period kernel thread | ||||
| must do the reporting on their behalf. | ||||
| This process is called “forcing quiescent states”, it is | ||||
| repeated every few jiffies, and its ordering effects are shown below: | ||||
| 
 | ||||
| </p><p><img src="TreeRCU-gp-fqs.svg" alt="TreeRCU-gp-fqs.svg" width="100%"> | ||||
| 
 | ||||
| <p>Each pass of quiescent state forcing is guaranteed to traverse the | ||||
| leaf <tt>rcu_node</tt> structures, and if there are no new quiescent | ||||
| states due to recently idled and/or offlined CPUs, then only the | ||||
| leaves are traversed. | ||||
| However, if there is a newly offlined CPU as illustrated on the left | ||||
| or a newly idled CPU as illustrated on the right, the corresponding | ||||
| quiescent state will be driven up towards the root. | ||||
| As with self-reported quiescent states, the upwards driving stops | ||||
| once it reaches an <tt>rcu_node</tt> structure that has quiescent | ||||
| states outstanding from other CPUs. | ||||
| 
 | ||||
| <table> | ||||
| <tr><th> </th></tr> | ||||
| <tr><th align="left">Quick Quiz:</th></tr> | ||||
| <tr><td> | ||||
| 	The leftmost drive to root stopped before it reached | ||||
| 	the root <tt>rcu_node</tt> structure, which means that | ||||
| 	there are still CPUs subordinate to that structure on | ||||
| 	which the current grace period is waiting. | ||||
| 	Given that, how is it possible that the rightmost drive | ||||
| 	to root ended the grace period? | ||||
| </td></tr> | ||||
| <tr><th align="left">Answer:</th></tr> | ||||
| <tr><td bgcolor="#ffffff"><font color="ffffff"> | ||||
| 	Good analysis! | ||||
| 	It is in fact impossible in the absence of bugs in RCU. | ||||
| 	But this diagram is complex enough as it is, so simplicity | ||||
| 	overrode accuracy. | ||||
| 	You can think of it as poetic license, or you can think of | ||||
| 	it as misdirection that is resolved in the | ||||
| 	<a href="#Putting It All Together">stitched-together diagram</a>. | ||||
| </font></td></tr> | ||||
| <tr><td> </td></tr> | ||||
| </table> | ||||
| 
 | ||||
| <h4><a name="Grace-Period Cleanup">Grace-Period Cleanup</a></h4> | ||||
| 
 | ||||
| <p>Grace-period cleanup first scans the <tt>rcu_node</tt> tree | ||||
| breadth-first advancing all the <tt>->gp_seq</tt> fields, then it | ||||
| advances the <tt>rcu_state</tt> structure's <tt>->gp_seq</tt> field. | ||||
| The ordering effects are shown below: | ||||
| 
 | ||||
| </p><p><img src="TreeRCU-gp-cleanup.svg" alt="TreeRCU-gp-cleanup.svg" width="75%"> | ||||
| 
 | ||||
| <p>As indicated by the oval at the bottom of the diagram, once | ||||
| grace-period cleanup is complete, the next grace period can begin. | ||||
| 
 | ||||
| <table> | ||||
| <tr><th> </th></tr> | ||||
| <tr><th align="left">Quick Quiz:</th></tr> | ||||
| <tr><td> | ||||
| 	But when precisely does the grace period end? | ||||
| </td></tr> | ||||
| <tr><th align="left">Answer:</th></tr> | ||||
| <tr><td bgcolor="#ffffff"><font color="ffffff"> | ||||
| 	There is no useful single point at which the grace period | ||||
| 	can be said to end. | ||||
| 	The earliest reasonable candidate is as soon as the last | ||||
| 	CPU has reported its quiescent state, but it may be some | ||||
| 	milliseconds before RCU becomes aware of this. | ||||
| 	The latest reasonable candidate is once the <tt>rcu_state</tt> | ||||
| 	structure's <tt>->gp_seq</tt> field has been updated, | ||||
| 	but it is quite possible that some CPUs have already completed | ||||
| 	phase two of their updates by that time. | ||||
| 	In short, if you are going to work with RCU, you need to | ||||
| 	learn to embrace uncertainty. | ||||
| </font></td></tr> | ||||
| <tr><td> </td></tr> | ||||
| </table> | ||||
| 
 | ||||
| 
 | ||||
| <h4><a name="Callback Invocation">Callback Invocation</a></h4> | ||||
| 
 | ||||
| <p>Once a given CPU's leaf <tt>rcu_node</tt> structure's | ||||
| <tt>->gp_seq</tt> field has been updated, that CPU can begin | ||||
| invoking its RCU callbacks that were waiting for this grace period | ||||
| to end. | ||||
| These callbacks are identified by <tt>rcu_advance_cbs()</tt>, | ||||
| which is usually invoked by <tt>__note_gp_changes()</tt>. | ||||
| As shown in the diagram below, this invocation can be triggered by | ||||
| the scheduling-clock interrupt (<tt>rcu_sched_clock_irq()</tt> on | ||||
| the left) or by idle entry (<tt>rcu_cleanup_after_idle()</tt> on | ||||
| the right, but only for kernels build with | ||||
| <tt>CONFIG_RCU_FAST_NO_HZ=y</tt>). | ||||
| Either way, <tt>RCU_SOFTIRQ</tt> is raised, which results in | ||||
| <tt>rcu_do_batch()</tt> invoking the callbacks, which in turn | ||||
| allows those callbacks to carry out (either directly or indirectly | ||||
| via wakeup) the needed phase-two processing for each update. | ||||
| 
 | ||||
| </p><p><img src="TreeRCU-callback-invocation.svg" alt="TreeRCU-callback-invocation.svg" width="60%"> | ||||
| 
 | ||||
| <p>Please note that callback invocation can also be prompted by any | ||||
| number of corner-case code paths, for example, when a CPU notes that | ||||
| it has excessive numbers of callbacks queued. | ||||
| In all cases, the CPU acquires its leaf <tt>rcu_node</tt> structure's | ||||
| <tt>->lock</tt> before invoking callbacks, which preserves the | ||||
| required ordering against the newly completed grace period. | ||||
| 
 | ||||
| <p>However, if the callback function communicates to other CPUs, | ||||
| for example, doing a wakeup, then it is that function's responsibility | ||||
| to maintain ordering. | ||||
| For example, if the callback function wakes up a task that runs on | ||||
| some other CPU, proper ordering must in place in both the callback | ||||
| function and the task being awakened. | ||||
| To see why this is important, consider the top half of the | ||||
| <a href="#Grace-Period Cleanup">grace-period cleanup</a> diagram. | ||||
| The callback might be running on a CPU corresponding to the leftmost | ||||
| leaf <tt>rcu_node</tt> structure, and awaken a task that is to run on | ||||
| a CPU corresponding to the rightmost leaf <tt>rcu_node</tt> structure, | ||||
| and the grace-period kernel thread might not yet have reached the | ||||
| rightmost leaf. | ||||
| In this case, the grace period's memory ordering might not yet have | ||||
| reached that CPU, so again the callback function and the awakened | ||||
| task must supply proper ordering. | ||||
| 
 | ||||
| <h3><a name="Putting It All Together">Putting It All Together</a></h3> | ||||
| 
 | ||||
| <p>A stitched-together diagram is | ||||
| <a href="Tree-RCU-Diagram.html">here</a>. | ||||
| 
 | ||||
| <h3><a name="Legal Statement"> | ||||
| Legal Statement</a></h3> | ||||
| 
 | ||||
| <p>This work represents the view of the author and does not necessarily | ||||
| represent the view of IBM. | ||||
| 
 | ||||
| </p><p>Linux is a registered trademark of Linus Torvalds. | ||||
| 
 | ||||
| </p><p>Other company, product, and service names may be trademarks or | ||||
| service marks of others. | ||||
| 
 | ||||
| </body></html> | ||||
|  | @ -0,0 +1,624 @@ | |||
| ====================================================== | ||||
| A Tour Through TREE_RCU's Grace-Period Memory Ordering | ||||
| ====================================================== | ||||
| 
 | ||||
| August 8, 2017 | ||||
| 
 | ||||
| This article was contributed by Paul E. McKenney | ||||
| 
 | ||||
| Introduction | ||||
| ============ | ||||
| 
 | ||||
| This document gives a rough visual overview of how Tree RCU's | ||||
| grace-period memory ordering guarantee is provided. | ||||
| 
 | ||||
| What Is Tree RCU's Grace Period Memory Ordering Guarantee? | ||||
| ========================================================== | ||||
| 
 | ||||
| RCU grace periods provide extremely strong memory-ordering guarantees | ||||
| for non-idle non-offline code. | ||||
| Any code that happens after the end of a given RCU grace period is guaranteed | ||||
| to see the effects of all accesses prior to the beginning of that grace | ||||
| period that are within RCU read-side critical sections. | ||||
| Similarly, any code that happens before the beginning of a given RCU grace | ||||
| period is guaranteed to see the effects of all accesses following the end | ||||
| of that grace period that are within RCU read-side critical sections. | ||||
| 
 | ||||
| Note well that RCU-sched read-side critical sections include any region | ||||
| of code for which preemption is disabled. | ||||
| Given that each individual machine instruction can be thought of as | ||||
| an extremely small region of preemption-disabled code, one can think of | ||||
| ``synchronize_rcu()`` as ``smp_mb()`` on steroids. | ||||
| 
 | ||||
| RCU updaters use this guarantee by splitting their updates into | ||||
| two phases, one of which is executed before the grace period and | ||||
| the other of which is executed after the grace period. | ||||
| In the most common use case, phase one removes an element from | ||||
| a linked RCU-protected data structure, and phase two frees that element. | ||||
| For this to work, any readers that have witnessed state prior to the | ||||
| phase-one update (in the common case, removal) must not witness state | ||||
| following the phase-two update (in the common case, freeing). | ||||
| 
 | ||||
| The RCU implementation provides this guarantee using a network | ||||
| of lock-based critical sections, memory barriers, and per-CPU | ||||
| processing, as is described in the following sections. | ||||
| 
 | ||||
| Tree RCU Grace Period Memory Ordering Building Blocks | ||||
| ===================================================== | ||||
| 
 | ||||
| The workhorse for RCU's grace-period memory ordering is the | ||||
| critical section for the ``rcu_node`` structure's | ||||
| ``->lock``. These critical sections use helper functions for lock | ||||
| acquisition, including ``raw_spin_lock_rcu_node()``, | ||||
| ``raw_spin_lock_irq_rcu_node()``, and ``raw_spin_lock_irqsave_rcu_node()``. | ||||
| Their lock-release counterparts are ``raw_spin_unlock_rcu_node()``, | ||||
| ``raw_spin_unlock_irq_rcu_node()``, and | ||||
| ``raw_spin_unlock_irqrestore_rcu_node()``, respectively. | ||||
| For completeness, a ``raw_spin_trylock_rcu_node()`` is also provided. | ||||
| The key point is that the lock-acquisition functions, including | ||||
| ``raw_spin_trylock_rcu_node()``, all invoke ``smp_mb__after_unlock_lock()`` | ||||
| immediately after successful acquisition of the lock. | ||||
| 
 | ||||
| Therefore, for any given ``rcu_node`` structure, any access | ||||
| happening before one of the above lock-release functions will be seen | ||||
| by all CPUs as happening before any access happening after a later | ||||
| one of the above lock-acquisition functions. | ||||
| Furthermore, any access happening before one of the | ||||
| above lock-release function on any given CPU will be seen by all | ||||
| CPUs as happening before any access happening after a later one | ||||
| of the above lock-acquisition functions executing on that same CPU, | ||||
| even if the lock-release and lock-acquisition functions are operating | ||||
| on different ``rcu_node`` structures. | ||||
| Tree RCU uses these two ordering guarantees to form an ordering | ||||
| network among all CPUs that were in any way involved in the grace | ||||
| period, including any CPUs that came online or went offline during | ||||
| the grace period in question. | ||||
| 
 | ||||
| The following litmus test exhibits the ordering effects of these | ||||
| lock-acquisition and lock-release functions:: | ||||
| 
 | ||||
|     1 int x, y, z; | ||||
|     2 | ||||
|     3 void task0(void) | ||||
|     4 { | ||||
|     5   raw_spin_lock_rcu_node(rnp); | ||||
|     6   WRITE_ONCE(x, 1); | ||||
|     7   r1 = READ_ONCE(y); | ||||
|     8   raw_spin_unlock_rcu_node(rnp); | ||||
|     9 } | ||||
|    10 | ||||
|    11 void task1(void) | ||||
|    12 { | ||||
|    13   raw_spin_lock_rcu_node(rnp); | ||||
|    14   WRITE_ONCE(y, 1); | ||||
|    15   r2 = READ_ONCE(z); | ||||
|    16   raw_spin_unlock_rcu_node(rnp); | ||||
|    17 } | ||||
|    18 | ||||
|    19 void task2(void) | ||||
|    20 { | ||||
|    21   WRITE_ONCE(z, 1); | ||||
|    22   smp_mb(); | ||||
|    23   r3 = READ_ONCE(x); | ||||
|    24 } | ||||
|    25 | ||||
|    26 WARN_ON(r1 == 0 && r2 == 0 && r3 == 0); | ||||
| 
 | ||||
| The ``WARN_ON()`` is evaluated at “the end of time”, | ||||
| after all changes have propagated throughout the system. | ||||
| Without the ``smp_mb__after_unlock_lock()`` provided by the | ||||
| acquisition functions, this ``WARN_ON()`` could trigger, for example | ||||
| on PowerPC. | ||||
| The ``smp_mb__after_unlock_lock()`` invocations prevent this | ||||
| ``WARN_ON()`` from triggering. | ||||
| 
 | ||||
| This approach must be extended to include idle CPUs, which need | ||||
| RCU's grace-period memory ordering guarantee to extend to any | ||||
| RCU read-side critical sections preceding and following the current | ||||
| idle sojourn. | ||||
| This case is handled by calls to the strongly ordered | ||||
| ``atomic_add_return()`` read-modify-write atomic operation that | ||||
| is invoked within ``rcu_dynticks_eqs_enter()`` at idle-entry | ||||
| time and within ``rcu_dynticks_eqs_exit()`` at idle-exit time. | ||||
| The grace-period kthread invokes ``rcu_dynticks_snap()`` and | ||||
| ``rcu_dynticks_in_eqs_since()`` (both of which invoke | ||||
| an ``atomic_add_return()`` of zero) to detect idle CPUs. | ||||
| 
 | ||||
| +-----------------------------------------------------------------------+ | ||||
| | **Quick Quiz**:                                                       | | ||||
| +-----------------------------------------------------------------------+ | ||||
| | But what about CPUs that remain offline for the entire grace period?  | | ||||
| +-----------------------------------------------------------------------+ | ||||
| | **Answer**:                                                           | | ||||
| +-----------------------------------------------------------------------+ | ||||
| | Such CPUs will be offline at the beginning of the grace period, so    | | ||||
| | the grace period won't expect quiescent states from them. Races       | | ||||
| | between grace-period start and CPU-hotplug operations are mediated    | | ||||
| | by the CPU's leaf ``rcu_node`` structure's ``->lock`` as described    | | ||||
| | above.                                                                | | ||||
| +-----------------------------------------------------------------------+ | ||||
| 
 | ||||
| The approach must be extended to handle one final case, that of waking a | ||||
| task blocked in ``synchronize_rcu()``. This task might be affinitied to | ||||
| a CPU that is not yet aware that the grace period has ended, and thus | ||||
| might not yet be subject to the grace period's memory ordering. | ||||
| Therefore, there is an ``smp_mb()`` after the return from | ||||
| ``wait_for_completion()`` in the ``synchronize_rcu()`` code path. | ||||
| 
 | ||||
| +-----------------------------------------------------------------------+ | ||||
| | **Quick Quiz**:                                                       | | ||||
| +-----------------------------------------------------------------------+ | ||||
| | What? Where??? I don't see any ``smp_mb()`` after the return from     | | ||||
| | ``wait_for_completion()``!!!                                          | | ||||
| +-----------------------------------------------------------------------+ | ||||
| | **Answer**:                                                           | | ||||
| +-----------------------------------------------------------------------+ | ||||
| | That would be because I spotted the need for that ``smp_mb()`` during | | ||||
| | the creation of this documentation, and it is therefore unlikely to   | | ||||
| | hit mainline before v4.14. Kudos to Lance Roy, Will Deacon, Peter     | | ||||
| | Zijlstra, and Jonathan Cameron for asking questions that sensitized   | | ||||
| | me to the rather elaborate sequence of events that demonstrate the    | | ||||
| | need for this memory barrier.                                         | | ||||
| +-----------------------------------------------------------------------+ | ||||
| 
 | ||||
| Tree RCU's grace--period memory-ordering guarantees rely most heavily on | ||||
| the ``rcu_node`` structure's ``->lock`` field, so much so that it is | ||||
| necessary to abbreviate this pattern in the diagrams in the next | ||||
| section. For example, consider the ``rcu_prepare_for_idle()`` function | ||||
| shown below, which is one of several functions that enforce ordering of | ||||
| newly arrived RCU callbacks against future grace periods: | ||||
| 
 | ||||
| :: | ||||
| 
 | ||||
|     1 static void rcu_prepare_for_idle(void) | ||||
|     2 { | ||||
|     3   bool needwake; | ||||
|     4   struct rcu_data *rdp; | ||||
|     5   struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks); | ||||
|     6   struct rcu_node *rnp; | ||||
|     7   struct rcu_state *rsp; | ||||
|     8   int tne; | ||||
|     9 | ||||
|    10   if (IS_ENABLED(CONFIG_RCU_NOCB_CPU_ALL) || | ||||
|    11       rcu_is_nocb_cpu(smp_processor_id())) | ||||
|    12     return; | ||||
|    13   tne = READ_ONCE(tick_nohz_active); | ||||
|    14   if (tne != rdtp->tick_nohz_enabled_snap) { | ||||
|    15     if (rcu_cpu_has_callbacks(NULL)) | ||||
|    16       invoke_rcu_core(); | ||||
|    17     rdtp->tick_nohz_enabled_snap = tne; | ||||
|    18     return; | ||||
|    19   } | ||||
|    20   if (!tne) | ||||
|    21     return; | ||||
|    22   if (rdtp->all_lazy && | ||||
|    23       rdtp->nonlazy_posted != rdtp->nonlazy_posted_snap) { | ||||
|    24     rdtp->all_lazy = false; | ||||
|    25     rdtp->nonlazy_posted_snap = rdtp->nonlazy_posted; | ||||
|    26     invoke_rcu_core(); | ||||
|    27     return; | ||||
|    28   } | ||||
|    29   if (rdtp->last_accelerate == jiffies) | ||||
|    30     return; | ||||
|    31   rdtp->last_accelerate = jiffies; | ||||
|    32   for_each_rcu_flavor(rsp) { | ||||
|    33     rdp = this_cpu_ptr(rsp->rda); | ||||
|    34     if (rcu_segcblist_pend_cbs(&rdp->cblist)) | ||||
|    35       continue; | ||||
|    36     rnp = rdp->mynode; | ||||
|    37     raw_spin_lock_rcu_node(rnp); | ||||
|    38     needwake = rcu_accelerate_cbs(rsp, rnp, rdp); | ||||
|    39     raw_spin_unlock_rcu_node(rnp); | ||||
|    40     if (needwake) | ||||
|    41       rcu_gp_kthread_wake(rsp); | ||||
|    42   } | ||||
|    43 } | ||||
| 
 | ||||
| But the only part of ``rcu_prepare_for_idle()`` that really matters for | ||||
| this discussion are lines 37–39. We will therefore abbreviate this | ||||
| function as follows: | ||||
| 
 | ||||
| .. kernel-figure:: rcu_node-lock.svg | ||||
| 
 | ||||
| The box represents the ``rcu_node`` structure's ``->lock`` critical | ||||
| section, with the double line on top representing the additional | ||||
| ``smp_mb__after_unlock_lock()``. | ||||
| 
 | ||||
| Tree RCU Grace Period Memory Ordering Components | ||||
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||||
| 
 | ||||
| Tree RCU's grace-period memory-ordering guarantee is provided by a | ||||
| number of RCU components: | ||||
| 
 | ||||
| #. `Callback Registry`_ | ||||
| #. `Grace-Period Initialization`_ | ||||
| #. `Self-Reported Quiescent States`_ | ||||
| #. `Dynamic Tick Interface`_ | ||||
| #. `CPU-Hotplug Interface`_ | ||||
| #. `Forcing Quiescent States`_ | ||||
| #. `Grace-Period Cleanup`_ | ||||
| #. `Callback Invocation`_ | ||||
| 
 | ||||
| Each of the following section looks at the corresponding component in | ||||
| detail. | ||||
| 
 | ||||
| Callback Registry | ||||
| ^^^^^^^^^^^^^^^^^ | ||||
| 
 | ||||
| If RCU's grace-period guarantee is to mean anything at all, any access | ||||
| that happens before a given invocation of ``call_rcu()`` must also | ||||
| happen before the corresponding grace period. The implementation of this | ||||
| portion of RCU's grace period guarantee is shown in the following | ||||
| figure: | ||||
| 
 | ||||
| .. kernel-figure:: TreeRCU-callback-registry.svg | ||||
| 
 | ||||
| Because ``call_rcu()`` normally acts only on CPU-local state, it | ||||
| provides no ordering guarantees, either for itself or for phase one of | ||||
| the update (which again will usually be removal of an element from an | ||||
| RCU-protected data structure). It simply enqueues the ``rcu_head`` | ||||
| structure on a per-CPU list, which cannot become associated with a grace | ||||
| period until a later call to ``rcu_accelerate_cbs()``, as shown in the | ||||
| diagram above. | ||||
| 
 | ||||
| One set of code paths shown on the left invokes ``rcu_accelerate_cbs()`` | ||||
| via ``note_gp_changes()``, either directly from ``call_rcu()`` (if the | ||||
| current CPU is inundated with queued ``rcu_head`` structures) or more | ||||
| likely from an ``RCU_SOFTIRQ`` handler. Another code path in the middle | ||||
| is taken only in kernels built with ``CONFIG_RCU_FAST_NO_HZ=y``, which | ||||
| invokes ``rcu_accelerate_cbs()`` via ``rcu_prepare_for_idle()``. The | ||||
| final code path on the right is taken only in kernels built with | ||||
| ``CONFIG_HOTPLUG_CPU=y``, which invokes ``rcu_accelerate_cbs()`` via | ||||
| ``rcu_advance_cbs()``, ``rcu_migrate_callbacks``, | ||||
| ``rcutree_migrate_callbacks()``, and ``takedown_cpu()``, which in turn | ||||
| is invoked on a surviving CPU after the outgoing CPU has been completely | ||||
| offlined. | ||||
| 
 | ||||
| There are a few other code paths within grace-period processing that | ||||
| opportunistically invoke ``rcu_accelerate_cbs()``. However, either way, | ||||
| all of the CPU's recently queued ``rcu_head`` structures are associated | ||||
| with a future grace-period number under the protection of the CPU's lead | ||||
| ``rcu_node`` structure's ``->lock``. In all cases, there is full | ||||
| ordering against any prior critical section for that same ``rcu_node`` | ||||
| structure's ``->lock``, and also full ordering against any of the | ||||
| current task's or CPU's prior critical sections for any ``rcu_node`` | ||||
| structure's ``->lock``. | ||||
| 
 | ||||
| The next section will show how this ordering ensures that any accesses | ||||
| prior to the ``call_rcu()`` (particularly including phase one of the | ||||
| update) happen before the start of the corresponding grace period. | ||||
| 
 | ||||
| +-----------------------------------------------------------------------+ | ||||
| | **Quick Quiz**:                                                       | | ||||
| +-----------------------------------------------------------------------+ | ||||
| | But what about ``synchronize_rcu()``?                                 | | ||||
| +-----------------------------------------------------------------------+ | ||||
| | **Answer**:                                                           | | ||||
| +-----------------------------------------------------------------------+ | ||||
| | The ``synchronize_rcu()`` passes ``call_rcu()`` to ``wait_rcu_gp()``, | | ||||
| | which invokes it. So either way, it eventually comes down to          | | ||||
| | ``call_rcu()``.                                                       | | ||||
| +-----------------------------------------------------------------------+ | ||||
| 
 | ||||
| Grace-Period Initialization | ||||
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||||
| 
 | ||||
| Grace-period initialization is carried out by the grace-period kernel | ||||
| thread, which makes several passes over the ``rcu_node`` tree within the | ||||
| ``rcu_gp_init()`` function. This means that showing the full flow of | ||||
| ordering through the grace-period computation will require duplicating | ||||
| this tree. If you find this confusing, please note that the state of the | ||||
| ``rcu_node`` changes over time, just like Heraclitus's river. However, | ||||
| to keep the ``rcu_node`` river tractable, the grace-period kernel | ||||
| thread's traversals are presented in multiple parts, starting in this | ||||
| section with the various phases of grace-period initialization. | ||||
| 
 | ||||
| The first ordering-related grace-period initialization action is to | ||||
| advance the ``rcu_state`` structure's ``->gp_seq`` grace-period-number | ||||
| counter, as shown below: | ||||
| 
 | ||||
| .. kernel-figure:: TreeRCU-gp-init-1.svg | ||||
| 
 | ||||
| The actual increment is carried out using ``smp_store_release()``, which | ||||
| helps reject false-positive RCU CPU stall detection. Note that only the | ||||
| root ``rcu_node`` structure is touched. | ||||
| 
 | ||||
| The first pass through the ``rcu_node`` tree updates bitmasks based on | ||||
| CPUs having come online or gone offline since the start of the previous | ||||
| grace period. In the common case where the number of online CPUs for | ||||
| this ``rcu_node`` structure has not transitioned to or from zero, this | ||||
| pass will scan only the leaf ``rcu_node`` structures. However, if the | ||||
| number of online CPUs for a given leaf ``rcu_node`` structure has | ||||
| transitioned from zero, ``rcu_init_new_rnp()`` will be invoked for the | ||||
| first incoming CPU. Similarly, if the number of online CPUs for a given | ||||
| leaf ``rcu_node`` structure has transitioned to zero, | ||||
| ``rcu_cleanup_dead_rnp()`` will be invoked for the last outgoing CPU. | ||||
| The diagram below shows the path of ordering if the leftmost | ||||
| ``rcu_node`` structure onlines its first CPU and if the next | ||||
| ``rcu_node`` structure has no online CPUs (or, alternatively if the | ||||
| leftmost ``rcu_node`` structure offlines its last CPU and if the next | ||||
| ``rcu_node`` structure has no online CPUs). | ||||
| 
 | ||||
| .. kernel-figure:: TreeRCU-gp-init-1.svg | ||||
| 
 | ||||
| The final ``rcu_gp_init()`` pass through the ``rcu_node`` tree traverses | ||||
| breadth-first, setting each ``rcu_node`` structure's ``->gp_seq`` field | ||||
| to the newly advanced value from the ``rcu_state`` structure, as shown | ||||
| in the following diagram. | ||||
| 
 | ||||
| .. kernel-figure:: TreeRCU-gp-init-1.svg | ||||
| 
 | ||||
| This change will also cause each CPU's next call to | ||||
| ``__note_gp_changes()`` to notice that a new grace period has started, | ||||
| as described in the next section. But because the grace-period kthread | ||||
| started the grace period at the root (with the advancing of the | ||||
| ``rcu_state`` structure's ``->gp_seq`` field) before setting each leaf | ||||
| ``rcu_node`` structure's ``->gp_seq`` field, each CPU's observation of | ||||
| the start of the grace period will happen after the actual start of the | ||||
| grace period. | ||||
| 
 | ||||
| +-----------------------------------------------------------------------+ | ||||
| | **Quick Quiz**:                                                       | | ||||
| +-----------------------------------------------------------------------+ | ||||
| | But what about the CPU that started the grace period? Why wouldn't it | | ||||
| | see the start of the grace period right when it started that grace    | | ||||
| | period?                                                               | | ||||
| +-----------------------------------------------------------------------+ | ||||
| | **Answer**:                                                           | | ||||
| +-----------------------------------------------------------------------+ | ||||
| | In some deep philosophical and overly anthromorphized sense, yes, the | | ||||
| | CPU starting the grace period is immediately aware of having done so. | | ||||
| | However, if we instead assume that RCU is not self-aware, then even   | | ||||
| | the CPU starting the grace period does not really become aware of the | | ||||
| | start of this grace period until its first call to                    | | ||||
| | ``__note_gp_changes()``. On the other hand, this CPU potentially gets | | ||||
| | early notification because it invokes ``__note_gp_changes()`` during  | | ||||
| | its last ``rcu_gp_init()`` pass through its leaf ``rcu_node``         | | ||||
| | structure.                                                            | | ||||
| +-----------------------------------------------------------------------+ | ||||
| 
 | ||||
| Self-Reported Quiescent States | ||||
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||||
| 
 | ||||
| When all entities that might block the grace period have reported | ||||
| quiescent states (or as described in a later section, had quiescent | ||||
| states reported on their behalf), the grace period can end. Online | ||||
| non-idle CPUs report their own quiescent states, as shown in the | ||||
| following diagram: | ||||
| 
 | ||||
| .. kernel-figure:: TreeRCU-qs.svg | ||||
| 
 | ||||
| This is for the last CPU to report a quiescent state, which signals the | ||||
| end of the grace period. Earlier quiescent states would push up the | ||||
| ``rcu_node`` tree only until they encountered an ``rcu_node`` structure | ||||
| that is waiting for additional quiescent states. However, ordering is | ||||
| nevertheless preserved because some later quiescent state will acquire | ||||
| that ``rcu_node`` structure's ``->lock``. | ||||
| 
 | ||||
| Any number of events can lead up to a CPU invoking ``note_gp_changes`` | ||||
| (or alternatively, directly invoking ``__note_gp_changes()``), at which | ||||
| point that CPU will notice the start of a new grace period while holding | ||||
| its leaf ``rcu_node`` lock. Therefore, all execution shown in this | ||||
| diagram happens after the start of the grace period. In addition, this | ||||
| CPU will consider any RCU read-side critical section that started before | ||||
| the invocation of ``__note_gp_changes()`` to have started before the | ||||
| grace period, and thus a critical section that the grace period must | ||||
| wait on. | ||||
| 
 | ||||
| +-----------------------------------------------------------------------+ | ||||
| | **Quick Quiz**:                                                       | | ||||
| +-----------------------------------------------------------------------+ | ||||
| | But a RCU read-side critical section might have started after the     | | ||||
| | beginning of the grace period (the advancing of ``->gp_seq`` from     | | ||||
| | earlier), so why should the grace period wait on such a critical      | | ||||
| | section?                                                              | | ||||
| +-----------------------------------------------------------------------+ | ||||
| | **Answer**:                                                           | | ||||
| +-----------------------------------------------------------------------+ | ||||
| | It is indeed not necessary for the grace period to wait on such a     | | ||||
| | critical section. However, it is permissible to wait on it. And it is | | ||||
| | furthermore important to wait on it, as this lazy approach is far     | | ||||
| | more scalable than a “big bang” all-at-once grace-period start could  | | ||||
| | possibly be.                                                          | | ||||
| +-----------------------------------------------------------------------+ | ||||
| 
 | ||||
| If the CPU does a context switch, a quiescent state will be noted by | ||||
| ``rcu_note_context_switch()`` on the left. On the other hand, if the CPU | ||||
| takes a scheduler-clock interrupt while executing in usermode, a | ||||
| quiescent state will be noted by ``rcu_sched_clock_irq()`` on the right. | ||||
| Either way, the passage through a quiescent state will be noted in a | ||||
| per-CPU variable. | ||||
| 
 | ||||
| The next time an ``RCU_SOFTIRQ`` handler executes on this CPU (for | ||||
| example, after the next scheduler-clock interrupt), ``rcu_core()`` will | ||||
| invoke ``rcu_check_quiescent_state()``, which will notice the recorded | ||||
| quiescent state, and invoke ``rcu_report_qs_rdp()``. If | ||||
| ``rcu_report_qs_rdp()`` verifies that the quiescent state really does | ||||
| apply to the current grace period, it invokes ``rcu_report_rnp()`` which | ||||
| traverses up the ``rcu_node`` tree as shown at the bottom of the | ||||
| diagram, clearing bits from each ``rcu_node`` structure's ``->qsmask`` | ||||
| field, and propagating up the tree when the result is zero. | ||||
| 
 | ||||
| Note that traversal passes upwards out of a given ``rcu_node`` structure | ||||
| only if the current CPU is reporting the last quiescent state for the | ||||
| subtree headed by that ``rcu_node`` structure. A key point is that if a | ||||
| CPU's traversal stops at a given ``rcu_node`` structure, then there will | ||||
| be a later traversal by another CPU (or perhaps the same one) that | ||||
| proceeds upwards from that point, and the ``rcu_node`` ``->lock`` | ||||
| guarantees that the first CPU's quiescent state happens before the | ||||
| remainder of the second CPU's traversal. Applying this line of thought | ||||
| repeatedly shows that all CPUs' quiescent states happen before the last | ||||
| CPU traverses through the root ``rcu_node`` structure, the “last CPU” | ||||
| being the one that clears the last bit in the root ``rcu_node`` | ||||
| structure's ``->qsmask`` field. | ||||
| 
 | ||||
| Dynamic Tick Interface | ||||
| ^^^^^^^^^^^^^^^^^^^^^^ | ||||
| 
 | ||||
| Due to energy-efficiency considerations, RCU is forbidden from | ||||
| disturbing idle CPUs. CPUs are therefore required to notify RCU when | ||||
| entering or leaving idle state, which they do via fully ordered | ||||
| value-returning atomic operations on a per-CPU variable. The ordering | ||||
| effects are as shown below: | ||||
| 
 | ||||
| .. kernel-figure:: TreeRCU-dyntick.svg | ||||
| 
 | ||||
| The RCU grace-period kernel thread samples the per-CPU idleness variable | ||||
| while holding the corresponding CPU's leaf ``rcu_node`` structure's | ||||
| ``->lock``. This means that any RCU read-side critical sections that | ||||
| precede the idle period (the oval near the top of the diagram above) | ||||
| will happen before the end of the current grace period. Similarly, the | ||||
| beginning of the current grace period will happen before any RCU | ||||
| read-side critical sections that follow the idle period (the oval near | ||||
| the bottom of the diagram above). | ||||
| 
 | ||||
| Plumbing this into the full grace-period execution is described | ||||
| `below <#Forcing%20Quiescent%20States>`__. | ||||
| 
 | ||||
| CPU-Hotplug Interface | ||||
| ^^^^^^^^^^^^^^^^^^^^^ | ||||
| 
 | ||||
| RCU is also forbidden from disturbing offline CPUs, which might well be | ||||
| powered off and removed from the system completely. CPUs are therefore | ||||
| required to notify RCU of their comings and goings as part of the | ||||
| corresponding CPU hotplug operations. The ordering effects are shown | ||||
| below: | ||||
| 
 | ||||
| .. kernel-figure:: TreeRCU-hotplug.svg | ||||
| 
 | ||||
| Because CPU hotplug operations are much less frequent than idle | ||||
| transitions, they are heavier weight, and thus acquire the CPU's leaf | ||||
| ``rcu_node`` structure's ``->lock`` and update this structure's | ||||
| ``->qsmaskinitnext``. The RCU grace-period kernel thread samples this | ||||
| mask to detect CPUs having gone offline since the beginning of this | ||||
| grace period. | ||||
| 
 | ||||
| Plumbing this into the full grace-period execution is described | ||||
| `below <#Forcing%20Quiescent%20States>`__. | ||||
| 
 | ||||
| Forcing Quiescent States | ||||
| ^^^^^^^^^^^^^^^^^^^^^^^^ | ||||
| 
 | ||||
| As noted above, idle and offline CPUs cannot report their own quiescent | ||||
| states, and therefore the grace-period kernel thread must do the | ||||
| reporting on their behalf. This process is called “forcing quiescent | ||||
| states”, it is repeated every few jiffies, and its ordering effects are | ||||
| shown below: | ||||
| 
 | ||||
| .. kernel-figure:: TreeRCU-gp-fqs.svg | ||||
| 
 | ||||
| Each pass of quiescent state forcing is guaranteed to traverse the leaf | ||||
| ``rcu_node`` structures, and if there are no new quiescent states due to | ||||
| recently idled and/or offlined CPUs, then only the leaves are traversed. | ||||
| However, if there is a newly offlined CPU as illustrated on the left or | ||||
| a newly idled CPU as illustrated on the right, the corresponding | ||||
| quiescent state will be driven up towards the root. As with | ||||
| self-reported quiescent states, the upwards driving stops once it | ||||
| reaches an ``rcu_node`` structure that has quiescent states outstanding | ||||
| from other CPUs. | ||||
| 
 | ||||
| +-----------------------------------------------------------------------+ | ||||
| | **Quick Quiz**:                                                       | | ||||
| +-----------------------------------------------------------------------+ | ||||
| | The leftmost drive to root stopped before it reached the root         | | ||||
| | ``rcu_node`` structure, which means that there are still CPUs         | | ||||
| | subordinate to that structure on which the current grace period is    | | ||||
| | waiting. Given that, how is it possible that the rightmost drive to   | | ||||
| | root ended the grace period?                                          | | ||||
| +-----------------------------------------------------------------------+ | ||||
| | **Answer**:                                                           | | ||||
| +-----------------------------------------------------------------------+ | ||||
| | Good analysis! It is in fact impossible in the absence of bugs in     | | ||||
| | RCU. But this diagram is complex enough as it is, so simplicity       | | ||||
| | overrode accuracy. You can think of it as poetic license, or you can  | | ||||
| | think of it as misdirection that is resolved in the                   | | ||||
| | `stitched-together diagram <#Putting%20It%20All%20Together>`__.       | | ||||
| +-----------------------------------------------------------------------+ | ||||
| 
 | ||||
| Grace-Period Cleanup | ||||
| ^^^^^^^^^^^^^^^^^^^^ | ||||
| 
 | ||||
| Grace-period cleanup first scans the ``rcu_node`` tree breadth-first | ||||
| advancing all the ``->gp_seq`` fields, then it advances the | ||||
| ``rcu_state`` structure's ``->gp_seq`` field. The ordering effects are | ||||
| shown below: | ||||
| 
 | ||||
| .. kernel-figure:: TreeRCU-gp-cleanup.svg | ||||
| 
 | ||||
| As indicated by the oval at the bottom of the diagram, once grace-period | ||||
| cleanup is complete, the next grace period can begin. | ||||
| 
 | ||||
| +-----------------------------------------------------------------------+ | ||||
| | **Quick Quiz**:                                                       | | ||||
| +-----------------------------------------------------------------------+ | ||||
| | But when precisely does the grace period end?                         | | ||||
| +-----------------------------------------------------------------------+ | ||||
| | **Answer**:                                                           | | ||||
| +-----------------------------------------------------------------------+ | ||||
| | There is no useful single point at which the grace period can be said | | ||||
| | to end. The earliest reasonable candidate is as soon as the last CPU  | | ||||
| | has reported its quiescent state, but it may be some milliseconds     | | ||||
| | before RCU becomes aware of this. The latest reasonable candidate is  | | ||||
| | once the ``rcu_state`` structure's ``->gp_seq`` field has been        | | ||||
| | updated, but it is quite possible that some CPUs have already         | | ||||
| | completed phase two of their updates by that time. In short, if you   | | ||||
| | are going to work with RCU, you need to learn to embrace uncertainty. | | ||||
| +-----------------------------------------------------------------------+ | ||||
| 
 | ||||
| Callback Invocation | ||||
| ^^^^^^^^^^^^^^^^^^^ | ||||
| 
 | ||||
| Once a given CPU's leaf ``rcu_node`` structure's ``->gp_seq`` field has | ||||
| been updated, that CPU can begin invoking its RCU callbacks that were | ||||
| waiting for this grace period to end. These callbacks are identified by | ||||
| ``rcu_advance_cbs()``, which is usually invoked by | ||||
| ``__note_gp_changes()``. As shown in the diagram below, this invocation | ||||
| can be triggered by the scheduling-clock interrupt | ||||
| (``rcu_sched_clock_irq()`` on the left) or by idle entry | ||||
| (``rcu_cleanup_after_idle()`` on the right, but only for kernels build | ||||
| with ``CONFIG_RCU_FAST_NO_HZ=y``). Either way, ``RCU_SOFTIRQ`` is | ||||
| raised, which results in ``rcu_do_batch()`` invoking the callbacks, | ||||
| which in turn allows those callbacks to carry out (either directly or | ||||
| indirectly via wakeup) the needed phase-two processing for each update. | ||||
| 
 | ||||
| .. kernel-figure:: TreeRCU-callback-invocation.svg | ||||
| 
 | ||||
| Please note that callback invocation can also be prompted by any number | ||||
| of corner-case code paths, for example, when a CPU notes that it has | ||||
| excessive numbers of callbacks queued. In all cases, the CPU acquires | ||||
| its leaf ``rcu_node`` structure's ``->lock`` before invoking callbacks, | ||||
| which preserves the required ordering against the newly completed grace | ||||
| period. | ||||
| 
 | ||||
| However, if the callback function communicates to other CPUs, for | ||||
| example, doing a wakeup, then it is that function's responsibility to | ||||
| maintain ordering. For example, if the callback function wakes up a task | ||||
| that runs on some other CPU, proper ordering must in place in both the | ||||
| callback function and the task being awakened. To see why this is | ||||
| important, consider the top half of the `grace-period | ||||
| cleanup <#Grace-Period%20Cleanup>`__ diagram. The callback might be | ||||
| running on a CPU corresponding to the leftmost leaf ``rcu_node`` | ||||
| structure, and awaken a task that is to run on a CPU corresponding to | ||||
| the rightmost leaf ``rcu_node`` structure, and the grace-period kernel | ||||
| thread might not yet have reached the rightmost leaf. In this case, the | ||||
| grace period's memory ordering might not yet have reached that CPU, so | ||||
| again the callback function and the awakened task must supply proper | ||||
| ordering. | ||||
| 
 | ||||
| Putting It All Together | ||||
| ~~~~~~~~~~~~~~~~~~~~~~~ | ||||
| 
 | ||||
| A stitched-together diagram is here: | ||||
| 
 | ||||
| .. kernel-figure:: TreeRCU-gp.svg | ||||
| 
 | ||||
| Legal Statement | ||||
| ~~~~~~~~~~~~~~~ | ||||
| 
 | ||||
| This work represents the view of the author and does not necessarily | ||||
| represent the view of IBM. | ||||
| 
 | ||||
| Linux is a registered trademark of Linus Torvalds. | ||||
| 
 | ||||
| Other company, product, and service names may be trademarks or service | ||||
| marks of others. | ||||
|  | @ -3880,7 +3880,7 @@ | |||
|          font-style="normal" | ||||
|          y="-4418.6582" | ||||
|          x="3745.7725" | ||||
|          xml:space="preserve">rcu_node_context_switch()</text> | ||||
|          xml:space="preserve">rcu_note_context_switch()</text> | ||||
|     </g> | ||||
|     <g | ||||
|        transform="translate(1881.1886,54048.57)" | ||||
|  |  | |||
| Before Width: | Height: | Size: 209 KiB After Width: | Height: | Size: 209 KiB | 
|  | @ -753,7 +753,7 @@ | |||
|          font-style="normal" | ||||
|          y="-4418.6582" | ||||
|          x="3745.7725" | ||||
|          xml:space="preserve">rcu_node_context_switch()</text> | ||||
|          xml:space="preserve">rcu_note_context_switch()</text> | ||||
|     </g> | ||||
|     <g | ||||
|        transform="translate(3131.2648,-585.6713)" | ||||
|  |  | |||
| Before Width: | Height: | Size: 43 KiB After Width: | Height: | Size: 43 KiB | 
										
											
												File diff suppressed because it is too large
												Load diff
											
										
									
								
							
							
								
								
									
										2704
									
								
								Documentation/RCU/Design/Requirements/Requirements.rst
									
										
									
									
									
										Normal file
									
								
							
							
						
						
									
										2704
									
								
								Documentation/RCU/Design/Requirements/Requirements.rst
									
										
									
									
									
										Normal file
									
								
							
										
											
												File diff suppressed because it is too large
												Load diff
											
										
									
								
							|  | @ -5,12 +5,17 @@ RCU concepts | |||
| ============ | ||||
| 
 | ||||
| .. toctree:: | ||||
|    :maxdepth: 1 | ||||
|    :maxdepth: 3 | ||||
| 
 | ||||
|    rcu | ||||
|    listRCU | ||||
|    UP | ||||
| 
 | ||||
|    Design/Memory-Ordering/Tree-RCU-Memory-Ordering | ||||
|    Design/Expedited-Grace-Periods/Expedited-Grace-Periods | ||||
|    Design/Requirements/Requirements | ||||
|    Design/Data-Structures/Data-Structures | ||||
| 
 | ||||
| .. only:: subproject and html | ||||
| 
 | ||||
|    Indices | ||||
|  |  | |||
|  | @ -96,7 +96,17 @@ other flavors of rcu_dereference().  On the other hand, it is illegal | |||
| to use rcu_dereference_protected() if either the RCU-protected pointer | ||||
| or the RCU-protected data that it points to can change concurrently. | ||||
| 
 | ||||
| There are currently only "universal" versions of the rcu_assign_pointer() | ||||
| and RCU list-/tree-traversal primitives, which do not (yet) check for | ||||
| being in an RCU read-side critical section.  In the future, separate | ||||
| versions of these primitives might be created. | ||||
| Like rcu_dereference(), when lockdep is enabled, RCU list and hlist | ||||
| traversal primitives check for being called from within an RCU read-side | ||||
| critical section.  However, a lockdep expression can be passed to them | ||||
| as a additional optional argument.  With this lockdep expression, these | ||||
| traversal primitives will complain only if the lockdep expression is | ||||
| false and they are called from outside any RCU read-side critical section. | ||||
| 
 | ||||
| For example, the workqueue for_each_pwq() macro is intended to be used | ||||
| either within an RCU read-side critical section or with wq->mutex held. | ||||
| It is thus implemented as follows: | ||||
| 
 | ||||
| 	#define for_each_pwq(pwq, wq) | ||||
| 		list_for_each_entry_rcu((pwq), &(wq)->pwqs, pwqs_node, | ||||
| 					lock_is_held(&(wq->mutex).dep_map)) | ||||
|  |  | |||
|  | @ -290,7 +290,7 @@ rcu_dereference() | |||
| 	at any time, including immediately after the rcu_dereference(). | ||||
| 	And, again like rcu_assign_pointer(), rcu_dereference() is | ||||
| 	typically used indirectly, via the _rcu list-manipulation | ||||
| 	primitives, such as list_for_each_entry_rcu(). | ||||
| 	primitives, such as list_for_each_entry_rcu() [2]. | ||||
| 
 | ||||
| 	[1] The variant rcu_dereference_protected() can be used outside | ||||
| 	of an RCU read-side critical section as long as the usage is | ||||
|  | @ -302,9 +302,17 @@ rcu_dereference() | |||
| 	must prohibit.	The rcu_dereference_protected() variant takes | ||||
| 	a lockdep expression to indicate which locks must be acquired | ||||
| 	by the caller. If the indicated protection is not provided, | ||||
| 	a lockdep splat is emitted.  See RCU/Design/Requirements/Requirements.html | ||||
| 	a lockdep splat is emitted.  See Documentation/RCU/Design/Requirements/Requirements.rst | ||||
| 	and the API's code comments for more details and example usage. | ||||
| 
 | ||||
| 	[2] If the list_for_each_entry_rcu() instance might be used by | ||||
| 	update-side code as well as by RCU readers, then an additional | ||||
| 	lockdep expression can be added to its list of arguments. | ||||
| 	For example, given an additional "lock_is_held(&mylock)" argument, | ||||
| 	the RCU lockdep code would complain only if this instance was | ||||
| 	invoked outside of an RCU read-side critical section and without | ||||
| 	the protection of mylock. | ||||
| 
 | ||||
| The following diagram shows how each API communicates among the | ||||
| reader, updater, and reclaimer. | ||||
| 
 | ||||
|  | @ -630,7 +638,7 @@ been able to write-acquire the lock otherwise.  The smp_mb__after_spinlock() | |||
| promotes synchronize_rcu() to a full memory barrier in compliance with | ||||
| the "Memory-Barrier Guarantees" listed in: | ||||
| 
 | ||||
| 	Documentation/RCU/Design/Requirements/Requirements.html. | ||||
| 	Documentation/RCU/Design/Requirements/Requirements.rst | ||||
| 
 | ||||
| It is possible to nest rcu_read_lock(), since reader-writer locks may | ||||
| be recursively acquired.  Note also that rcu_read_lock() is immune | ||||
|  |  | |||
|  | @ -508,8 +508,8 @@ int kvm_vm_ioctl_set_pmu_event_filter(struct kvm *kvm, void __user *argp) | |||
| 	*filter = tmp; | ||||
| 
 | ||||
| 	mutex_lock(&kvm->lock); | ||||
| 	rcu_swap_protected(kvm->arch.pmu_event_filter, filter, | ||||
| 			   mutex_is_locked(&kvm->lock)); | ||||
| 	filter = rcu_replace_pointer(kvm->arch.pmu_event_filter, filter, | ||||
| 				     mutex_is_locked(&kvm->lock)); | ||||
| 	mutex_unlock(&kvm->lock); | ||||
| 
 | ||||
| 	synchronize_srcu_expedited(&kvm->srcu); | ||||
|  |  | |||
|  | @ -1634,7 +1634,7 @@ replace: | |||
| 		i915_gem_context_set_user_engines(ctx); | ||||
| 	else | ||||
| 		i915_gem_context_clear_user_engines(ctx); | ||||
| 	rcu_swap_protected(ctx->engines, set.engines, 1); | ||||
| 	set.engines = rcu_replace_pointer(ctx->engines, set.engines, 1); | ||||
| 	mutex_unlock(&ctx->engines_mutex); | ||||
| 
 | ||||
| 	call_rcu(&set.engines->rcu, free_engines_rcu); | ||||
|  |  | |||
|  | @ -434,8 +434,8 @@ static void scsi_update_vpd_page(struct scsi_device *sdev, u8 page, | |||
| 		return; | ||||
| 
 | ||||
| 	mutex_lock(&sdev->inquiry_mutex); | ||||
| 	rcu_swap_protected(*sdev_vpd_buf, vpd_buf, | ||||
| 			   lockdep_is_held(&sdev->inquiry_mutex)); | ||||
| 	vpd_buf = rcu_replace_pointer(*sdev_vpd_buf, vpd_buf, | ||||
| 				      lockdep_is_held(&sdev->inquiry_mutex)); | ||||
| 	mutex_unlock(&sdev->inquiry_mutex); | ||||
| 
 | ||||
| 	if (vpd_buf) | ||||
|  |  | |||
|  | @ -466,10 +466,10 @@ static void scsi_device_dev_release_usercontext(struct work_struct *work) | |||
| 	sdev->request_queue = NULL; | ||||
| 
 | ||||
| 	mutex_lock(&sdev->inquiry_mutex); | ||||
| 	rcu_swap_protected(sdev->vpd_pg80, vpd_pg80, | ||||
| 			   lockdep_is_held(&sdev->inquiry_mutex)); | ||||
| 	rcu_swap_protected(sdev->vpd_pg83, vpd_pg83, | ||||
| 			   lockdep_is_held(&sdev->inquiry_mutex)); | ||||
| 	vpd_pg80 = rcu_replace_pointer(sdev->vpd_pg80, vpd_pg80, | ||||
| 				       lockdep_is_held(&sdev->inquiry_mutex)); | ||||
| 	vpd_pg83 = rcu_replace_pointer(sdev->vpd_pg83, vpd_pg83, | ||||
| 				       lockdep_is_held(&sdev->inquiry_mutex)); | ||||
| 	mutex_unlock(&sdev->inquiry_mutex); | ||||
| 
 | ||||
| 	if (vpd_pg83) | ||||
|  |  | |||
|  | @ -279,8 +279,8 @@ struct afs_vlserver_list *afs_extract_vlserver_list(struct afs_cell *cell, | |||
| 			struct afs_addr_list *old = addrs; | ||||
| 
 | ||||
| 			write_lock(&server->lock); | ||||
| 			rcu_swap_protected(server->addresses, old, | ||||
| 					   lockdep_is_held(&server->lock)); | ||||
| 			old = rcu_replace_pointer(server->addresses, old, | ||||
| 						  lockdep_is_held(&server->lock)); | ||||
| 			write_unlock(&server->lock); | ||||
| 			afs_put_addrlist(old); | ||||
| 		} | ||||
|  |  | |||
|  | @ -24,34 +24,6 @@ static inline struct hlist_bl_node *hlist_bl_first_rcu(struct hlist_bl_head *h) | |||
| 		((unsigned long)rcu_dereference_check(h->first, hlist_bl_is_locked(h)) & ~LIST_BL_LOCKMASK); | ||||
| } | ||||
| 
 | ||||
| /**
 | ||||
|  * hlist_bl_del_init_rcu - deletes entry from hash list with re-initialization | ||||
|  * @n: the element to delete from the hash list. | ||||
|  * | ||||
|  * Note: hlist_bl_unhashed() on the node returns true after this. It is | ||||
|  * useful for RCU based read lockfree traversal if the writer side | ||||
|  * must know if the list entry is still hashed or already unhashed. | ||||
|  * | ||||
|  * In particular, it means that we can not poison the forward pointers | ||||
|  * that may still be used for walking the hash list and we can only | ||||
|  * zero the pprev pointer so list_unhashed() will return true after | ||||
|  * this. | ||||
|  * | ||||
|  * The caller must take whatever precautions are necessary (such as | ||||
|  * holding appropriate locks) to avoid racing with another | ||||
|  * list-mutation primitive, such as hlist_bl_add_head_rcu() or | ||||
|  * hlist_bl_del_rcu(), running on this same list.  However, it is | ||||
|  * perfectly legal to run concurrently with the _rcu list-traversal | ||||
|  * primitives, such as hlist_bl_for_each_entry_rcu(). | ||||
|  */ | ||||
| static inline void hlist_bl_del_init_rcu(struct hlist_bl_node *n) | ||||
| { | ||||
| 	if (!hlist_bl_unhashed(n)) { | ||||
| 		__hlist_bl_del(n); | ||||
| 		n->pprev = NULL; | ||||
| 	} | ||||
| } | ||||
| 
 | ||||
| /**
 | ||||
|  * hlist_bl_del_rcu - deletes entry from hash list without re-initialization | ||||
|  * @n: the element to delete from the hash list. | ||||
|  |  | |||
|  | @ -382,6 +382,24 @@ do {									      \ | |||
| 		smp_store_release(&p, RCU_INITIALIZER((typeof(p))_r_a_p__v)); \ | ||||
| } while (0) | ||||
| 
 | ||||
| /**
 | ||||
|  * rcu_replace_pointer() - replace an RCU pointer, returning its old value | ||||
|  * @rcu_ptr: RCU pointer, whose old value is returned | ||||
|  * @ptr: regular pointer | ||||
|  * @c: the lockdep conditions under which the dereference will take place | ||||
|  * | ||||
|  * Perform a replacement, where @rcu_ptr is an RCU-annotated | ||||
|  * pointer and @c is the lockdep argument that is passed to the | ||||
|  * rcu_dereference_protected() call used to read that pointer.  The old | ||||
|  * value of @rcu_ptr is returned, and @rcu_ptr is set to @ptr. | ||||
|  */ | ||||
| #define rcu_replace_pointer(rcu_ptr, ptr, c)				\ | ||||
| ({									\ | ||||
| 	typeof(ptr) __tmp = rcu_dereference_protected((rcu_ptr), (c));	\ | ||||
| 	rcu_assign_pointer((rcu_ptr), (ptr));				\ | ||||
| 	__tmp;								\ | ||||
| }) | ||||
| 
 | ||||
| /**
 | ||||
|  * rcu_swap_protected() - swap an RCU and a regular pointer | ||||
|  * @rcu_ptr: RCU pointer | ||||
|  |  | |||
|  | @ -84,6 +84,7 @@ static inline void rcu_scheduler_starting(void) { } | |||
| #endif /* #else #ifndef CONFIG_SRCU */ | ||||
| static inline void rcu_end_inkernel_boot(void) { } | ||||
| static inline bool rcu_is_watching(void) { return true; } | ||||
| static inline void rcu_momentary_dyntick_idle(void) { } | ||||
| 
 | ||||
| /* Avoid RCU read-side critical sections leaking across. */ | ||||
| static inline void rcu_all_qs(void) { barrier(); } | ||||
|  |  | |||
|  | @ -37,6 +37,7 @@ void kfree_call_rcu(struct rcu_head *head, rcu_callback_t func); | |||
| 
 | ||||
| void rcu_barrier(void); | ||||
| bool rcu_eqs_special_set(int cpu); | ||||
| void rcu_momentary_dyntick_idle(void); | ||||
| unsigned long get_state_synchronize_rcu(void); | ||||
| void cond_synchronize_rcu(unsigned long oldstate); | ||||
| 
 | ||||
|  |  | |||
|  | @ -108,7 +108,8 @@ enum tick_dep_bits { | |||
| 	TICK_DEP_BIT_POSIX_TIMER	= 0, | ||||
| 	TICK_DEP_BIT_PERF_EVENTS	= 1, | ||||
| 	TICK_DEP_BIT_SCHED		= 2, | ||||
| 	TICK_DEP_BIT_CLOCK_UNSTABLE	= 3 | ||||
| 	TICK_DEP_BIT_CLOCK_UNSTABLE	= 3, | ||||
| 	TICK_DEP_BIT_RCU		= 4 | ||||
| }; | ||||
| 
 | ||||
| #define TICK_DEP_MASK_NONE		0 | ||||
|  | @ -116,6 +117,7 @@ enum tick_dep_bits { | |||
| #define TICK_DEP_MASK_PERF_EVENTS	(1 << TICK_DEP_BIT_PERF_EVENTS) | ||||
| #define TICK_DEP_MASK_SCHED		(1 << TICK_DEP_BIT_SCHED) | ||||
| #define TICK_DEP_MASK_CLOCK_UNSTABLE	(1 << TICK_DEP_BIT_CLOCK_UNSTABLE) | ||||
| #define TICK_DEP_MASK_RCU		(1 << TICK_DEP_BIT_RCU) | ||||
| 
 | ||||
| #ifdef CONFIG_NO_HZ_COMMON | ||||
| extern bool tick_nohz_enabled; | ||||
|  | @ -268,6 +270,9 @@ static inline bool tick_nohz_full_enabled(void) { return false; } | |||
| static inline bool tick_nohz_full_cpu(int cpu) { return false; } | ||||
| static inline void tick_nohz_full_add_cpus_to(struct cpumask *mask) { } | ||||
| 
 | ||||
| static inline void tick_nohz_dep_set_cpu(int cpu, enum tick_dep_bits bit) { } | ||||
| static inline void tick_nohz_dep_clear_cpu(int cpu, enum tick_dep_bits bit) { } | ||||
| 
 | ||||
| static inline void tick_dep_set(enum tick_dep_bits bit) { } | ||||
| static inline void tick_dep_clear(enum tick_dep_bits bit) { } | ||||
| static inline void tick_dep_set_cpu(int cpu, enum tick_dep_bits bit) { } | ||||
|  |  | |||
|  | @ -93,16 +93,16 @@ TRACE_EVENT_RCU(rcu_grace_period, | |||
|  * the data from the rcu_node structure, other than rcuname, which comes | ||||
|  * from the rcu_state structure, and event, which is one of the following: | ||||
|  * | ||||
|  * "Startleaf": Request a grace period based on leaf-node data. | ||||
|  * "Cleanup": Clean up rcu_node structure after previous GP. | ||||
|  * "CleanupMore": Clean up, and another GP is needed. | ||||
|  * "EndWait": Complete wait. | ||||
|  * "NoGPkthread": The RCU grace-period kthread has not yet started. | ||||
|  * "Prestarted": Someone beat us to the request | ||||
|  * "Startedleaf": Leaf node marked for future GP. | ||||
|  * "Startedleafroot": All nodes from leaf to root marked for future GP. | ||||
|  * "Startedroot": Requested a nocb grace period based on root-node data. | ||||
|  * "NoGPkthread": The RCU grace-period kthread has not yet started. | ||||
|  * "Startleaf": Request a grace period based on leaf-node data. | ||||
|  * "StartWait": Start waiting for the requested grace period. | ||||
|  * "EndWait": Complete wait. | ||||
|  * "Cleanup": Clean up rcu_node structure after previous GP. | ||||
|  * "CleanupMore": Clean up, and another GP is needed. | ||||
|  */ | ||||
| TRACE_EVENT_RCU(rcu_future_grace_period, | ||||
| 
 | ||||
|  | @ -258,20 +258,27 @@ TRACE_EVENT_RCU(rcu_exp_funnel_lock, | |||
|  * the number of the offloaded CPU are extracted.  The third and final | ||||
|  * argument is a string as follows: | ||||
|  * | ||||
|  *	"WakeEmpty": Wake rcuo kthread, first CB to empty list. | ||||
|  *	"WakeEmptyIsDeferred": Wake rcuo kthread later, first CB to empty list. | ||||
|  *	"WakeOvf": Wake rcuo kthread, CB list is huge. | ||||
|  *	"WakeOvfIsDeferred": Wake rcuo kthread later, CB list is huge. | ||||
|  *	"WakeNot": Don't wake rcuo kthread. | ||||
|  *	"WakeNotPoll": Don't wake rcuo kthread because it is polling. | ||||
|  *	"DeferredWake": Carried out the "IsDeferred" wakeup. | ||||
|  *	"Poll": Start of new polling cycle for rcu_nocb_poll. | ||||
|  *	"Sleep": Sleep waiting for GP for !rcu_nocb_poll. | ||||
|  *	"CBSleep": Sleep waiting for CBs for !rcu_nocb_poll. | ||||
|  *	"WokeEmpty": rcuo kthread woke to find empty list. | ||||
|  *	"WokeNonEmpty": rcuo kthread woke to find non-empty list. | ||||
|  *	"WaitQueue": Enqueue partially done, timed wait for it to complete. | ||||
|  *	"WokeQueue": Partial enqueue now complete. | ||||
|  * "AlreadyAwake": The to-be-awakened rcuo kthread is already awake. | ||||
|  * "Bypass": rcuo GP kthread sees non-empty ->nocb_bypass. | ||||
|  * "CBSleep": rcuo CB kthread sleeping waiting for CBs. | ||||
|  * "Check": rcuo GP kthread checking specified CPU for work. | ||||
|  * "DeferredWake": Timer expired or polled check, time to wake. | ||||
|  * "DoWake": The to-be-awakened rcuo kthread needs to be awakened. | ||||
|  * "EndSleep": Done waiting for GP for !rcu_nocb_poll. | ||||
|  * "FirstBQ": New CB to empty ->nocb_bypass (->cblist maybe non-empty). | ||||
|  * "FirstBQnoWake": FirstBQ plus rcuo kthread need not be awakened. | ||||
|  * "FirstBQwake": FirstBQ plus rcuo kthread must be awakened. | ||||
|  * "FirstQ": New CB to empty ->cblist (->nocb_bypass maybe non-empty). | ||||
|  * "NeedWaitGP": rcuo GP kthread must wait on a grace period. | ||||
|  * "Poll": Start of new polling cycle for rcu_nocb_poll. | ||||
|  * "Sleep": Sleep waiting for GP for !rcu_nocb_poll. | ||||
|  * "Timer": Deferred-wake timer expired. | ||||
|  * "WakeEmptyIsDeferred": Wake rcuo kthread later, first CB to empty list. | ||||
|  * "WakeEmpty": Wake rcuo kthread, first CB to empty list. | ||||
|  * "WakeNot": Don't wake rcuo kthread. | ||||
|  * "WakeNotPoll": Don't wake rcuo kthread because it is polling. | ||||
|  * "WakeOvfIsDeferred": Wake rcuo kthread later, CB list is huge. | ||||
|  * "WokeEmpty": rcuo CB kthread woke to find empty list. | ||||
|  */ | ||||
| TRACE_EVENT_RCU(rcu_nocb_wake, | ||||
| 
 | ||||
|  | @ -713,8 +720,6 @@ TRACE_EVENT_RCU(rcu_torture_read, | |||
|  *	"Begin": rcu_barrier() started. | ||||
|  *	"EarlyExit": rcu_barrier() piggybacked, thus early exit. | ||||
|  *	"Inc1": rcu_barrier() piggyback check counter incremented. | ||||
|  *	"OfflineNoCB": rcu_barrier() found callback on never-online CPU | ||||
|  *	"OnlineNoCB": rcu_barrier() found online no-CBs CPU. | ||||
|  *	"OnlineQ": rcu_barrier() found online CPU with callbacks. | ||||
|  *	"OnlineNQ": rcu_barrier() found online CPU, no callbacks. | ||||
|  *	"IRQ": An rcu_barrier_callback() callback posted on remote CPU. | ||||
|  |  | |||
|  | @ -367,7 +367,8 @@ TRACE_EVENT(itimer_expire, | |||
| 		tick_dep_name(POSIX_TIMER)		\ | ||||
| 		tick_dep_name(PERF_EVENTS)		\ | ||||
| 		tick_dep_name(SCHED)			\ | ||||
| 		tick_dep_name_end(CLOCK_UNSTABLE) | ||||
| 		tick_dep_name(CLOCK_UNSTABLE)		\ | ||||
| 		tick_dep_name_end(RCU) | ||||
| 
 | ||||
| #undef tick_dep_name | ||||
| #undef tick_dep_mask_name | ||||
|  |  | |||
|  | @ -180,8 +180,8 @@ static void activate_effective_progs(struct cgroup *cgrp, | |||
| 				     enum bpf_attach_type type, | ||||
| 				     struct bpf_prog_array *old_array) | ||||
| { | ||||
| 	rcu_swap_protected(cgrp->bpf.effective[type], old_array, | ||||
| 			   lockdep_is_held(&cgroup_mutex)); | ||||
| 	old_array = rcu_replace_pointer(cgrp->bpf.effective[type], old_array, | ||||
| 					lockdep_is_held(&cgroup_mutex)); | ||||
| 	/* free prog array after grace period, since __cgroup_bpf_run_*()
 | ||||
| 	 * might be still walking the array | ||||
| 	 */ | ||||
|  |  | |||
|  | @ -16,7 +16,6 @@ | |||
| #include <linux/kthread.h> | ||||
| #include <linux/sched/rt.h> | ||||
| #include <linux/spinlock.h> | ||||
| #include <linux/rwlock.h> | ||||
| #include <linux/mutex.h> | ||||
| #include <linux/rwsem.h> | ||||
| #include <linux/smp.h> | ||||
|  | @ -889,16 +888,16 @@ static int __init lock_torture_init(void) | |||
| 		cxt.nrealwriters_stress = 2 * num_online_cpus(); | ||||
| 
 | ||||
| #ifdef CONFIG_DEBUG_MUTEXES | ||||
| 	if (strncmp(torture_type, "mutex", 5) == 0) | ||||
| 	if (str_has_prefix(torture_type, "mutex")) | ||||
| 		cxt.debug_lock = true; | ||||
| #endif | ||||
| #ifdef CONFIG_DEBUG_RT_MUTEXES | ||||
| 	if (strncmp(torture_type, "rtmutex", 7) == 0) | ||||
| 	if (str_has_prefix(torture_type, "rtmutex")) | ||||
| 		cxt.debug_lock = true; | ||||
| #endif | ||||
| #ifdef CONFIG_DEBUG_SPINLOCK | ||||
| 	if ((strncmp(torture_type, "spin", 4) == 0) || | ||||
| 	    (strncmp(torture_type, "rw_lock", 7) == 0)) | ||||
| 	if ((str_has_prefix(torture_type, "spin")) || | ||||
| 	    (str_has_prefix(torture_type, "rw_lock"))) | ||||
| 		cxt.debug_lock = true; | ||||
| #endif | ||||
| 
 | ||||
|  |  | |||
|  | @ -299,6 +299,8 @@ static inline void rcu_init_levelspread(int *levelspread, const int *levelcnt) | |||
| { | ||||
| 	int i; | ||||
| 
 | ||||
| 	for (i = 0; i < RCU_NUM_LVLS; i++) | ||||
| 		levelspread[i] = INT_MIN; | ||||
| 	if (rcu_fanout_exact) { | ||||
| 		levelspread[rcu_num_lvls - 1] = rcu_fanout_leaf; | ||||
| 		for (i = rcu_num_lvls - 2; i >= 0; i--) | ||||
|  | @ -455,7 +457,6 @@ enum rcutorture_type { | |||
| #if defined(CONFIG_TREE_RCU) || defined(CONFIG_PREEMPT_RCU) | ||||
| void rcutorture_get_gp_data(enum rcutorture_type test_type, int *flags, | ||||
| 			    unsigned long *gp_seq); | ||||
| void rcutorture_record_progress(unsigned long vernum); | ||||
| void do_trace_rcu_torture_read(const char *rcutorturename, | ||||
| 			       struct rcu_head *rhp, | ||||
| 			       unsigned long secs, | ||||
|  | @ -468,7 +469,6 @@ static inline void rcutorture_get_gp_data(enum rcutorture_type test_type, | |||
| 	*flags = 0; | ||||
| 	*gp_seq = 0; | ||||
| } | ||||
| static inline void rcutorture_record_progress(unsigned long vernum) { } | ||||
| #ifdef CONFIG_RCU_TRACE | ||||
| void do_trace_rcu_torture_read(const char *rcutorturename, | ||||
| 			       struct rcu_head *rhp, | ||||
|  |  | |||
|  | @ -88,7 +88,7 @@ struct rcu_head *rcu_cblist_dequeue(struct rcu_cblist *rclp) | |||
| } | ||||
| 
 | ||||
| /* Set the length of an rcu_segcblist structure. */ | ||||
| void rcu_segcblist_set_len(struct rcu_segcblist *rsclp, long v) | ||||
| static void rcu_segcblist_set_len(struct rcu_segcblist *rsclp, long v) | ||||
| { | ||||
| #ifdef CONFIG_RCU_NOCB_CPU | ||||
| 	atomic_long_set(&rsclp->len, v); | ||||
|  | @ -104,7 +104,7 @@ void rcu_segcblist_set_len(struct rcu_segcblist *rsclp, long v) | |||
|  * This increase is fully ordered with respect to the callers accesses | ||||
|  * both before and after. | ||||
|  */ | ||||
| void rcu_segcblist_add_len(struct rcu_segcblist *rsclp, long v) | ||||
| static void rcu_segcblist_add_len(struct rcu_segcblist *rsclp, long v) | ||||
| { | ||||
| #ifdef CONFIG_RCU_NOCB_CPU | ||||
| 	smp_mb__before_atomic(); /* Up to the caller! */ | ||||
|  | @ -134,7 +134,7 @@ void rcu_segcblist_inc_len(struct rcu_segcblist *rsclp) | |||
|  * with the actual number of callbacks on the structure.  This exchange is | ||||
|  * fully ordered with respect to the callers accesses both before and after. | ||||
|  */ | ||||
| long rcu_segcblist_xchg_len(struct rcu_segcblist *rsclp, long v) | ||||
| static long rcu_segcblist_xchg_len(struct rcu_segcblist *rsclp, long v) | ||||
| { | ||||
| #ifdef CONFIG_RCU_NOCB_CPU | ||||
| 	return atomic_long_xchg(&rsclp->len, v); | ||||
|  |  | |||
|  | @ -109,15 +109,6 @@ static unsigned long b_rcu_perf_writer_started; | |||
| static unsigned long b_rcu_perf_writer_finished; | ||||
| static DEFINE_PER_CPU(atomic_t, n_async_inflight); | ||||
| 
 | ||||
| static int rcu_perf_writer_state; | ||||
| #define RTWS_INIT		0 | ||||
| #define RTWS_ASYNC		1 | ||||
| #define RTWS_BARRIER		2 | ||||
| #define RTWS_EXP_SYNC		3 | ||||
| #define RTWS_SYNC		4 | ||||
| #define RTWS_IDLE		5 | ||||
| #define RTWS_STOPPING		6 | ||||
| 
 | ||||
| #define MAX_MEAS 10000 | ||||
| #define MIN_MEAS 100 | ||||
| 
 | ||||
|  | @ -404,25 +395,20 @@ retry: | |||
| 			if (!rhp) | ||||
| 				rhp = kmalloc(sizeof(*rhp), GFP_KERNEL); | ||||
| 			if (rhp && atomic_read(this_cpu_ptr(&n_async_inflight)) < gp_async_max) { | ||||
| 				rcu_perf_writer_state = RTWS_ASYNC; | ||||
| 				atomic_inc(this_cpu_ptr(&n_async_inflight)); | ||||
| 				cur_ops->async(rhp, rcu_perf_async_cb); | ||||
| 				rhp = NULL; | ||||
| 			} else if (!kthread_should_stop()) { | ||||
| 				rcu_perf_writer_state = RTWS_BARRIER; | ||||
| 				cur_ops->gp_barrier(); | ||||
| 				goto retry; | ||||
| 			} else { | ||||
| 				kfree(rhp); /* Because we are stopping. */ | ||||
| 			} | ||||
| 		} else if (gp_exp) { | ||||
| 			rcu_perf_writer_state = RTWS_EXP_SYNC; | ||||
| 			cur_ops->exp_sync(); | ||||
| 		} else { | ||||
| 			rcu_perf_writer_state = RTWS_SYNC; | ||||
| 			cur_ops->sync(); | ||||
| 		} | ||||
| 		rcu_perf_writer_state = RTWS_IDLE; | ||||
| 		t = ktime_get_mono_fast_ns(); | ||||
| 		*wdp = t - *wdp; | ||||
| 		i_max = i; | ||||
|  | @ -463,10 +449,8 @@ retry: | |||
| 		rcu_perf_wait_shutdown(); | ||||
| 	} while (!torture_must_stop()); | ||||
| 	if (gp_async) { | ||||
| 		rcu_perf_writer_state = RTWS_BARRIER; | ||||
| 		cur_ops->gp_barrier(); | ||||
| 	} | ||||
| 	rcu_perf_writer_state = RTWS_STOPPING; | ||||
| 	writer_n_durations[me] = i_max; | ||||
| 	torture_kthread_stopping("rcu_perf_writer"); | ||||
| 	return 0; | ||||
|  |  | |||
|  | @ -44,6 +44,7 @@ | |||
| #include <linux/sched/debug.h> | ||||
| #include <linux/sched/sysctl.h> | ||||
| #include <linux/oom.h> | ||||
| #include <linux/tick.h> | ||||
| 
 | ||||
| #include "rcu.h" | ||||
| 
 | ||||
|  | @ -1363,15 +1364,15 @@ rcu_torture_reader(void *arg) | |||
| 	set_user_nice(current, MAX_NICE); | ||||
| 	if (irqreader && cur_ops->irq_capable) | ||||
| 		timer_setup_on_stack(&t, rcu_torture_timer, 0); | ||||
| 
 | ||||
| 	tick_dep_set_task(current, TICK_DEP_BIT_RCU); | ||||
| 	do { | ||||
| 		if (irqreader && cur_ops->irq_capable) { | ||||
| 			if (!timer_pending(&t)) | ||||
| 				mod_timer(&t, jiffies + 1); | ||||
| 		} | ||||
| 		if (!rcu_torture_one_read(&rand)) | ||||
| 		if (!rcu_torture_one_read(&rand) && !torture_must_stop()) | ||||
| 			schedule_timeout_interruptible(HZ); | ||||
| 		if (time_after(jiffies, lastsleep)) { | ||||
| 		if (time_after(jiffies, lastsleep) && !torture_must_stop()) { | ||||
| 			schedule_timeout_interruptible(1); | ||||
| 			lastsleep = jiffies + 10; | ||||
| 		} | ||||
|  | @ -1383,6 +1384,7 @@ rcu_torture_reader(void *arg) | |||
| 		del_timer_sync(&t); | ||||
| 		destroy_timer_on_stack(&t); | ||||
| 	} | ||||
| 	tick_dep_clear_task(current, TICK_DEP_BIT_RCU); | ||||
| 	torture_kthread_stopping("rcu_torture_reader"); | ||||
| 	return 0; | ||||
| } | ||||
|  | @ -1442,15 +1444,18 @@ rcu_torture_stats_print(void) | |||
| 		n_rcu_torture_barrier_error); | ||||
| 
 | ||||
| 	pr_alert("%s%s ", torture_type, TORTURE_FLAG); | ||||
| 	if (atomic_read(&n_rcu_torture_mberror) != 0 || | ||||
| 	    n_rcu_torture_barrier_error != 0 || | ||||
| 	    n_rcu_torture_boost_ktrerror != 0 || | ||||
| 	    n_rcu_torture_boost_rterror != 0 || | ||||
| 	    n_rcu_torture_boost_failure != 0 || | ||||
| 	if (atomic_read(&n_rcu_torture_mberror) || | ||||
| 	    n_rcu_torture_barrier_error || n_rcu_torture_boost_ktrerror || | ||||
| 	    n_rcu_torture_boost_rterror || n_rcu_torture_boost_failure || | ||||
| 	    i > 1) { | ||||
| 		pr_cont("%s", "!!! "); | ||||
| 		atomic_inc(&n_rcu_torture_error); | ||||
| 		WARN_ON_ONCE(1); | ||||
| 		WARN_ON_ONCE(atomic_read(&n_rcu_torture_mberror)); | ||||
| 		WARN_ON_ONCE(n_rcu_torture_barrier_error);  // rcu_barrier()
 | ||||
| 		WARN_ON_ONCE(n_rcu_torture_boost_ktrerror); // no boost kthread
 | ||||
| 		WARN_ON_ONCE(n_rcu_torture_boost_rterror); // can't set RT prio
 | ||||
| 		WARN_ON_ONCE(n_rcu_torture_boost_failure); // RCU boost failed
 | ||||
| 		WARN_ON_ONCE(i > 1); // Too-short grace period
 | ||||
| 	} | ||||
| 	pr_cont("Reader Pipe: "); | ||||
| 	for (i = 0; i < RCU_TORTURE_PIPE_LEN + 1; i++) | ||||
|  | @ -1729,10 +1734,10 @@ static void rcu_torture_fwd_prog_cond_resched(unsigned long iter) | |||
| 		// Real call_rcu() floods hit userspace, so emulate that.
 | ||||
| 		if (need_resched() || (iter & 0xfff)) | ||||
| 			schedule(); | ||||
| 	} else { | ||||
| 		// No userspace emulation: CB invocation throttles call_rcu()
 | ||||
| 		cond_resched(); | ||||
| 		return; | ||||
| 	} | ||||
| 	// No userspace emulation: CB invocation throttles call_rcu()
 | ||||
| 	cond_resched(); | ||||
| } | ||||
| 
 | ||||
| /*
 | ||||
|  | @ -1759,6 +1764,11 @@ static unsigned long rcu_torture_fwd_prog_cbfree(void) | |||
| 		kfree(rfcp); | ||||
| 		freed++; | ||||
| 		rcu_torture_fwd_prog_cond_resched(freed); | ||||
| 		if (tick_nohz_full_enabled()) { | ||||
| 			local_irq_save(flags); | ||||
| 			rcu_momentary_dyntick_idle(); | ||||
| 			local_irq_restore(flags); | ||||
| 		} | ||||
| 	} | ||||
| 	return freed; | ||||
| } | ||||
|  | @ -1803,7 +1813,7 @@ static void rcu_torture_fwd_prog_nr(int *tested, int *tested_tries) | |||
| 		udelay(10); | ||||
| 		cur_ops->readunlock(idx); | ||||
| 		if (!fwd_progress_need_resched || need_resched()) | ||||
| 			rcu_torture_fwd_prog_cond_resched(1); | ||||
| 			cond_resched(); | ||||
| 	} | ||||
| 	(*tested_tries)++; | ||||
| 	if (!time_before(jiffies, stopat) && | ||||
|  | @ -1833,6 +1843,7 @@ static void rcu_torture_fwd_prog_nr(int *tested, int *tested_tries) | |||
| static void rcu_torture_fwd_prog_cr(void) | ||||
| { | ||||
| 	unsigned long cver; | ||||
| 	unsigned long flags; | ||||
| 	unsigned long gps; | ||||
| 	int i; | ||||
| 	long n_launders; | ||||
|  | @ -1865,6 +1876,7 @@ static void rcu_torture_fwd_prog_cr(void) | |||
| 	cver = READ_ONCE(rcu_torture_current_version); | ||||
| 	gps = cur_ops->get_gp_seq(); | ||||
| 	rcu_launder_gp_seq_start = gps; | ||||
| 	tick_dep_set_task(current, TICK_DEP_BIT_RCU); | ||||
| 	while (time_before(jiffies, stopat) && | ||||
| 	       !shutdown_time_arrived() && | ||||
| 	       !READ_ONCE(rcu_fwd_emergency_stop) && !torture_must_stop()) { | ||||
|  | @ -1891,6 +1903,11 @@ static void rcu_torture_fwd_prog_cr(void) | |||
| 		} | ||||
| 		cur_ops->call(&rfcp->rh, rcu_torture_fwd_cb_cr); | ||||
| 		rcu_torture_fwd_prog_cond_resched(n_launders + n_max_cbs); | ||||
| 		if (tick_nohz_full_enabled()) { | ||||
| 			local_irq_save(flags); | ||||
| 			rcu_momentary_dyntick_idle(); | ||||
| 			local_irq_restore(flags); | ||||
| 		} | ||||
| 	} | ||||
| 	stoppedat = jiffies; | ||||
| 	n_launders_cb_snap = READ_ONCE(n_launders_cb); | ||||
|  | @ -1911,6 +1928,7 @@ static void rcu_torture_fwd_prog_cr(void) | |||
| 		rcu_torture_fwd_cb_hist(); | ||||
| 	} | ||||
| 	schedule_timeout_uninterruptible(HZ); /* Let CBs drain. */ | ||||
| 	tick_dep_clear_task(current, TICK_DEP_BIT_RCU); | ||||
| 	WRITE_ONCE(rcu_fwd_cb_nodelay, false); | ||||
| } | ||||
| 
 | ||||
|  |  | |||
|  | @ -364,7 +364,7 @@ bool rcu_eqs_special_set(int cpu) | |||
|  * | ||||
|  * The caller must have disabled interrupts and must not be idle. | ||||
|  */ | ||||
| static void __maybe_unused rcu_momentary_dyntick_idle(void) | ||||
| void rcu_momentary_dyntick_idle(void) | ||||
| { | ||||
| 	int special; | ||||
| 
 | ||||
|  | @ -375,6 +375,7 @@ static void __maybe_unused rcu_momentary_dyntick_idle(void) | |||
| 	WARN_ON_ONCE(!(special & RCU_DYNTICK_CTRL_CTR)); | ||||
| 	rcu_preempt_deferred_qs(current); | ||||
| } | ||||
| EXPORT_SYMBOL_GPL(rcu_momentary_dyntick_idle); | ||||
| 
 | ||||
| /**
 | ||||
|  * rcu_is_cpu_rrupt_from_idle - see if interrupted from idle | ||||
|  | @ -496,7 +497,7 @@ module_param_cb(jiffies_till_next_fqs, &next_fqs_jiffies_ops, &jiffies_till_next | |||
| module_param(rcu_kick_kthreads, bool, 0644); | ||||
| 
 | ||||
| static void force_qs_rnp(int (*f)(struct rcu_data *rdp)); | ||||
| static int rcu_pending(void); | ||||
| static int rcu_pending(int user); | ||||
| 
 | ||||
| /*
 | ||||
|  * Return the number of RCU GPs completed thus far for debug & stats. | ||||
|  | @ -824,6 +825,11 @@ static __always_inline void rcu_nmi_enter_common(bool irq) | |||
| 			rcu_cleanup_after_idle(); | ||||
| 
 | ||||
| 		incby = 1; | ||||
| 	} else if (tick_nohz_full_cpu(rdp->cpu) && | ||||
| 		   rdp->dynticks_nmi_nesting == DYNTICK_IRQ_NONIDLE && | ||||
| 		   READ_ONCE(rdp->rcu_urgent_qs) && !rdp->rcu_forced_tick) { | ||||
| 		rdp->rcu_forced_tick = true; | ||||
| 		tick_dep_set_cpu(rdp->cpu, TICK_DEP_BIT_RCU); | ||||
| 	} | ||||
| 	trace_rcu_dyntick(incby == 1 ? TPS("Endirq") : TPS("++="), | ||||
| 			  rdp->dynticks_nmi_nesting, | ||||
|  | @ -885,6 +891,21 @@ void rcu_irq_enter_irqson(void) | |||
| 	local_irq_restore(flags); | ||||
| } | ||||
| 
 | ||||
| /*
 | ||||
|  * If any sort of urgency was applied to the current CPU (for example, | ||||
|  * the scheduler-clock interrupt was enabled on a nohz_full CPU) in order | ||||
|  * to get to a quiescent state, disable it. | ||||
|  */ | ||||
| static void rcu_disable_urgency_upon_qs(struct rcu_data *rdp) | ||||
| { | ||||
| 	WRITE_ONCE(rdp->rcu_urgent_qs, false); | ||||
| 	WRITE_ONCE(rdp->rcu_need_heavy_qs, false); | ||||
| 	if (tick_nohz_full_cpu(rdp->cpu) && rdp->rcu_forced_tick) { | ||||
| 		tick_dep_clear_cpu(rdp->cpu, TICK_DEP_BIT_RCU); | ||||
| 		rdp->rcu_forced_tick = false; | ||||
| 	} | ||||
| } | ||||
| 
 | ||||
| /**
 | ||||
|  * rcu_is_watching - see if RCU thinks that the current CPU is not idle | ||||
|  * | ||||
|  | @ -1073,6 +1094,7 @@ static int rcu_implicit_dynticks_qs(struct rcu_data *rdp) | |||
| 	if (tick_nohz_full_cpu(rdp->cpu) && | ||||
| 		   time_after(jiffies, | ||||
| 			      READ_ONCE(rdp->last_fqs_resched) + jtsq * 3)) { | ||||
| 		WRITE_ONCE(*ruqp, true); | ||||
| 		resched_cpu(rdp->cpu); | ||||
| 		WRITE_ONCE(rdp->last_fqs_resched, jiffies); | ||||
| 	} | ||||
|  | @ -1968,7 +1990,6 @@ rcu_report_qs_rdp(int cpu, struct rcu_data *rdp) | |||
| 		return; | ||||
| 	} | ||||
| 	mask = rdp->grpmask; | ||||
| 	rdp->core_needs_qs = false; | ||||
| 	if ((rnp->qsmask & mask) == 0) { | ||||
| 		raw_spin_unlock_irqrestore_rcu_node(rnp, flags); | ||||
| 	} else { | ||||
|  | @ -1979,6 +2000,7 @@ rcu_report_qs_rdp(int cpu, struct rcu_data *rdp) | |||
| 		if (!offloaded) | ||||
| 			needwake = rcu_accelerate_cbs(rnp, rdp); | ||||
| 
 | ||||
| 		rcu_disable_urgency_upon_qs(rdp); | ||||
| 		rcu_report_qs_rnp(mask, rnp, rnp->gp_seq, flags); | ||||
| 		/* ^^^ Released rnp->lock */ | ||||
| 		if (needwake) | ||||
|  | @ -2101,6 +2123,9 @@ int rcutree_dead_cpu(unsigned int cpu) | |||
| 	rcu_boost_kthread_setaffinity(rnp, -1); | ||||
| 	/* Do any needed no-CB deferred wakeups from this CPU. */ | ||||
| 	do_nocb_deferred_wakeup(per_cpu_ptr(&rcu_data, cpu)); | ||||
| 
 | ||||
| 	// Stop-machine done, so allow nohz_full to disable tick.
 | ||||
| 	tick_dep_clear(TICK_DEP_BIT_RCU); | ||||
| 	return 0; | ||||
| } | ||||
| 
 | ||||
|  | @ -2151,6 +2176,7 @@ static void rcu_do_batch(struct rcu_data *rdp) | |||
| 	rcu_nocb_unlock_irqrestore(rdp, flags); | ||||
| 
 | ||||
| 	/* Invoke callbacks. */ | ||||
| 	tick_dep_set_task(current, TICK_DEP_BIT_RCU); | ||||
| 	rhp = rcu_cblist_dequeue(&rcl); | ||||
| 	for (; rhp; rhp = rcu_cblist_dequeue(&rcl)) { | ||||
| 		debug_rcu_head_unqueue(rhp); | ||||
|  | @ -2217,6 +2243,7 @@ static void rcu_do_batch(struct rcu_data *rdp) | |||
| 	/* Re-invoke RCU core processing if there are callbacks remaining. */ | ||||
| 	if (!offloaded && rcu_segcblist_ready_cbs(&rdp->cblist)) | ||||
| 		invoke_rcu_core(); | ||||
| 	tick_dep_clear_task(current, TICK_DEP_BIT_RCU); | ||||
| } | ||||
| 
 | ||||
| /*
 | ||||
|  | @ -2241,7 +2268,7 @@ void rcu_sched_clock_irq(int user) | |||
| 		__this_cpu_write(rcu_data.rcu_urgent_qs, false); | ||||
| 	} | ||||
| 	rcu_flavor_sched_clock_irq(user); | ||||
| 	if (rcu_pending()) | ||||
| 	if (rcu_pending(user)) | ||||
| 		invoke_rcu_core(); | ||||
| 
 | ||||
| 	trace_rcu_utilization(TPS("End scheduler-tick")); | ||||
|  | @ -2259,6 +2286,7 @@ static void force_qs_rnp(int (*f)(struct rcu_data *rdp)) | |||
| 	int cpu; | ||||
| 	unsigned long flags; | ||||
| 	unsigned long mask; | ||||
| 	struct rcu_data *rdp; | ||||
| 	struct rcu_node *rnp; | ||||
| 
 | ||||
| 	rcu_for_each_leaf_node(rnp) { | ||||
|  | @ -2283,8 +2311,11 @@ static void force_qs_rnp(int (*f)(struct rcu_data *rdp)) | |||
| 		for_each_leaf_node_possible_cpu(rnp, cpu) { | ||||
| 			unsigned long bit = leaf_node_cpu_bit(rnp, cpu); | ||||
| 			if ((rnp->qsmask & bit) != 0) { | ||||
| 				if (f(per_cpu_ptr(&rcu_data, cpu))) | ||||
| 				rdp = per_cpu_ptr(&rcu_data, cpu); | ||||
| 				if (f(rdp)) { | ||||
| 					mask |= bit; | ||||
| 					rcu_disable_urgency_upon_qs(rdp); | ||||
| 				} | ||||
| 			} | ||||
| 		} | ||||
| 		if (mask != 0) { | ||||
|  | @ -2312,7 +2343,7 @@ void rcu_force_quiescent_state(void) | |||
| 	rnp = __this_cpu_read(rcu_data.mynode); | ||||
| 	for (; rnp != NULL; rnp = rnp->parent) { | ||||
| 		ret = (READ_ONCE(rcu_state.gp_flags) & RCU_GP_FLAG_FQS) || | ||||
| 		      !raw_spin_trylock(&rnp->fqslock); | ||||
| 		       !raw_spin_trylock(&rnp->fqslock); | ||||
| 		if (rnp_old != NULL) | ||||
| 			raw_spin_unlock(&rnp_old->fqslock); | ||||
| 		if (ret) | ||||
|  | @ -2786,8 +2817,9 @@ EXPORT_SYMBOL_GPL(cond_synchronize_rcu); | |||
|  * CPU-local state are performed first.  However, we must check for CPU | ||||
|  * stalls first, else we might not get a chance. | ||||
|  */ | ||||
| static int rcu_pending(void) | ||||
| static int rcu_pending(int user) | ||||
| { | ||||
| 	bool gp_in_progress; | ||||
| 	struct rcu_data *rdp = this_cpu_ptr(&rcu_data); | ||||
| 	struct rcu_node *rnp = rdp->mynode; | ||||
| 
 | ||||
|  | @ -2798,12 +2830,13 @@ static int rcu_pending(void) | |||
| 	if (rcu_nocb_need_deferred_wakeup(rdp)) | ||||
| 		return 1; | ||||
| 
 | ||||
| 	/* Is this CPU a NO_HZ_FULL CPU that should ignore RCU? */ | ||||
| 	if (rcu_nohz_full_cpu()) | ||||
| 	/* Is this a nohz_full CPU in userspace or idle?  (Ignore RCU if so.) */ | ||||
| 	if ((user || rcu_is_cpu_rrupt_from_idle()) && rcu_nohz_full_cpu()) | ||||
| 		return 0; | ||||
| 
 | ||||
| 	/* Is the RCU core waiting for a quiescent state from this CPU? */ | ||||
| 	if (rdp->core_needs_qs && !rdp->cpu_no_qs.b.norm) | ||||
| 	gp_in_progress = rcu_gp_in_progress(); | ||||
| 	if (rdp->core_needs_qs && !rdp->cpu_no_qs.b.norm && gp_in_progress) | ||||
| 		return 1; | ||||
| 
 | ||||
| 	/* Does this CPU have callbacks ready to invoke? */ | ||||
|  | @ -2811,8 +2844,7 @@ static int rcu_pending(void) | |||
| 		return 1; | ||||
| 
 | ||||
| 	/* Has RCU gone idle with this CPU needing another grace period? */ | ||||
| 	if (!rcu_gp_in_progress() && | ||||
| 	    rcu_segcblist_is_enabled(&rdp->cblist) && | ||||
| 	if (!gp_in_progress && rcu_segcblist_is_enabled(&rdp->cblist) && | ||||
| 	    (!IS_ENABLED(CONFIG_RCU_NOCB_CPU) || | ||||
| 	     !rcu_segcblist_is_offloaded(&rdp->cblist)) && | ||||
| 	    !rcu_segcblist_restempty(&rdp->cblist, RCU_NEXT_READY_TAIL)) | ||||
|  | @ -2845,7 +2877,7 @@ static void rcu_barrier_callback(struct rcu_head *rhp) | |||
| { | ||||
| 	if (atomic_dec_and_test(&rcu_state.barrier_cpu_count)) { | ||||
| 		rcu_barrier_trace(TPS("LastCB"), -1, | ||||
| 				   rcu_state.barrier_sequence); | ||||
| 				  rcu_state.barrier_sequence); | ||||
| 		complete(&rcu_state.barrier_completion); | ||||
| 	} else { | ||||
| 		rcu_barrier_trace(TPS("CB"), -1, rcu_state.barrier_sequence); | ||||
|  | @ -2869,7 +2901,7 @@ static void rcu_barrier_func(void *unused) | |||
| 	} else { | ||||
| 		debug_rcu_head_unqueue(&rdp->barrier_head); | ||||
| 		rcu_barrier_trace(TPS("IRQNQ"), -1, | ||||
| 				   rcu_state.barrier_sequence); | ||||
| 				  rcu_state.barrier_sequence); | ||||
| 	} | ||||
| 	rcu_nocb_unlock(rdp); | ||||
| } | ||||
|  | @ -2896,7 +2928,7 @@ void rcu_barrier(void) | |||
| 	/* Did someone else do our work for us? */ | ||||
| 	if (rcu_seq_done(&rcu_state.barrier_sequence, s)) { | ||||
| 		rcu_barrier_trace(TPS("EarlyExit"), -1, | ||||
| 				   rcu_state.barrier_sequence); | ||||
| 				  rcu_state.barrier_sequence); | ||||
| 		smp_mb(); /* caller's subsequent code after above check. */ | ||||
| 		mutex_unlock(&rcu_state.barrier_mutex); | ||||
| 		return; | ||||
|  | @ -2928,11 +2960,11 @@ void rcu_barrier(void) | |||
| 			continue; | ||||
| 		if (rcu_segcblist_n_cbs(&rdp->cblist)) { | ||||
| 			rcu_barrier_trace(TPS("OnlineQ"), cpu, | ||||
| 					   rcu_state.barrier_sequence); | ||||
| 					  rcu_state.barrier_sequence); | ||||
| 			smp_call_function_single(cpu, rcu_barrier_func, NULL, 1); | ||||
| 		} else { | ||||
| 			rcu_barrier_trace(TPS("OnlineNQ"), cpu, | ||||
| 					   rcu_state.barrier_sequence); | ||||
| 					  rcu_state.barrier_sequence); | ||||
| 		} | ||||
| 	} | ||||
| 	put_online_cpus(); | ||||
|  | @ -3083,6 +3115,9 @@ int rcutree_online_cpu(unsigned int cpu) | |||
| 		return 0; /* Too early in boot for scheduler work. */ | ||||
| 	sync_sched_exp_online_cleanup(cpu); | ||||
| 	rcutree_affinity_setting(cpu, -1); | ||||
| 
 | ||||
| 	// Stop-machine done, so allow nohz_full to disable tick.
 | ||||
| 	tick_dep_clear(TICK_DEP_BIT_RCU); | ||||
| 	return 0; | ||||
| } | ||||
| 
 | ||||
|  | @ -3103,6 +3138,9 @@ int rcutree_offline_cpu(unsigned int cpu) | |||
| 	raw_spin_unlock_irqrestore_rcu_node(rnp, flags); | ||||
| 
 | ||||
| 	rcutree_affinity_setting(cpu, cpu); | ||||
| 
 | ||||
| 	// nohz_full CPUs need the tick for stop-machine to work quickly
 | ||||
| 	tick_dep_set(TICK_DEP_BIT_RCU); | ||||
| 	return 0; | ||||
| } | ||||
| 
 | ||||
|  | @ -3148,6 +3186,7 @@ void rcu_cpu_starting(unsigned int cpu) | |||
| 	rdp->rcu_onl_gp_seq = READ_ONCE(rcu_state.gp_seq); | ||||
| 	rdp->rcu_onl_gp_flags = READ_ONCE(rcu_state.gp_flags); | ||||
| 	if (rnp->qsmask & mask) { /* RCU waiting on incoming CPU? */ | ||||
| 		rcu_disable_urgency_upon_qs(rdp); | ||||
| 		/* Report QS -after- changing ->qsmaskinitnext! */ | ||||
| 		rcu_report_qs_rnp(mask, rnp, rnp->gp_seq, flags); | ||||
| 	} else { | ||||
|  |  | |||
|  | @ -181,6 +181,7 @@ struct rcu_data { | |||
| 	atomic_t dynticks;		/* Even value for idle, else odd. */ | ||||
| 	bool rcu_need_heavy_qs;		/* GP old, so heavy quiescent state! */ | ||||
| 	bool rcu_urgent_qs;		/* GP old need light quiescent state. */ | ||||
| 	bool rcu_forced_tick;		/* Forced tick to provide QS. */ | ||||
| #ifdef CONFIG_RCU_FAST_NO_HZ | ||||
| 	bool all_lazy;			/* All CPU's CBs lazy at idle start? */ | ||||
| 	unsigned long last_accelerate;	/* Last jiffy CBs were accelerated. */ | ||||
|  |  | |||
|  | @ -1946,7 +1946,7 @@ static void nocb_gp_wait(struct rcu_data *my_rdp) | |||
| 	int __maybe_unused cpu = my_rdp->cpu; | ||||
| 	unsigned long cur_gp_seq; | ||||
| 	unsigned long flags; | ||||
| 	bool gotcbs; | ||||
| 	bool gotcbs = false; | ||||
| 	unsigned long j = jiffies; | ||||
| 	bool needwait_gp = false; // This prevents actual uninitialized use.
 | ||||
| 	bool needwake; | ||||
|  |  | |||
|  | @ -235,6 +235,7 @@ static int multi_cpu_stop(void *data) | |||
| 			 */ | ||||
| 			touch_nmi_watchdog(); | ||||
| 		} | ||||
| 		rcu_momentary_dyntick_idle(); | ||||
| 	} while (curstate != MULTI_STOP_EXIT); | ||||
| 
 | ||||
| 	local_irq_restore(flags); | ||||
|  |  | |||
|  | @ -172,6 +172,7 @@ static void tick_sched_handle(struct tick_sched *ts, struct pt_regs *regs) | |||
| #ifdef CONFIG_NO_HZ_FULL | ||||
| cpumask_var_t tick_nohz_full_mask; | ||||
| bool tick_nohz_full_running; | ||||
| EXPORT_SYMBOL_GPL(tick_nohz_full_running); | ||||
| static atomic_t tick_dep_mask; | ||||
| 
 | ||||
| static bool check_tick_dependency(atomic_t *dep) | ||||
|  | @ -198,6 +199,11 @@ static bool check_tick_dependency(atomic_t *dep) | |||
| 		return true; | ||||
| 	} | ||||
| 
 | ||||
| 	if (val & TICK_DEP_MASK_RCU) { | ||||
| 		trace_tick_stop(0, TICK_DEP_MASK_RCU); | ||||
| 		return true; | ||||
| 	} | ||||
| 
 | ||||
| 	return false; | ||||
| } | ||||
| 
 | ||||
|  | @ -324,6 +330,7 @@ void tick_nohz_dep_set_cpu(int cpu, enum tick_dep_bits bit) | |||
| 		preempt_enable(); | ||||
| 	} | ||||
| } | ||||
| EXPORT_SYMBOL_GPL(tick_nohz_dep_set_cpu); | ||||
| 
 | ||||
| void tick_nohz_dep_clear_cpu(int cpu, enum tick_dep_bits bit) | ||||
| { | ||||
|  | @ -331,6 +338,7 @@ void tick_nohz_dep_clear_cpu(int cpu, enum tick_dep_bits bit) | |||
| 
 | ||||
| 	atomic_andnot(BIT(bit), &ts->tick_dep_mask); | ||||
| } | ||||
| EXPORT_SYMBOL_GPL(tick_nohz_dep_clear_cpu); | ||||
| 
 | ||||
| /*
 | ||||
|  * Set a per-task tick dependency. Posix CPU timers need this in order to elapse | ||||
|  | @ -344,11 +352,13 @@ void tick_nohz_dep_set_task(struct task_struct *tsk, enum tick_dep_bits bit) | |||
| 	 */ | ||||
| 	tick_nohz_dep_set_all(&tsk->tick_dep_mask, bit); | ||||
| } | ||||
| EXPORT_SYMBOL_GPL(tick_nohz_dep_set_task); | ||||
| 
 | ||||
| void tick_nohz_dep_clear_task(struct task_struct *tsk, enum tick_dep_bits bit) | ||||
| { | ||||
| 	atomic_andnot(BIT(bit), &tsk->tick_dep_mask); | ||||
| } | ||||
| EXPORT_SYMBOL_GPL(tick_nohz_dep_clear_task); | ||||
| 
 | ||||
| /*
 | ||||
|  * Set a per-taskgroup tick dependency. Posix CPU timers need this in order to elapse | ||||
|  | @ -397,6 +407,7 @@ void __init tick_nohz_full_setup(cpumask_var_t cpumask) | |||
| 	cpumask_copy(tick_nohz_full_mask, cpumask); | ||||
| 	tick_nohz_full_running = true; | ||||
| } | ||||
| EXPORT_SYMBOL_GPL(tick_nohz_full_setup); | ||||
| 
 | ||||
| static int tick_nohz_cpu_down(unsigned int cpu) | ||||
| { | ||||
|  |  | |||
|  | @ -365,11 +365,6 @@ static void show_pwq(struct pool_workqueue *pwq); | |||
| 			 !lockdep_is_held(&wq_pool_mutex),		\ | ||||
| 			 "RCU or wq_pool_mutex should be held") | ||||
| 
 | ||||
| #define assert_rcu_or_wq_mutex(wq)					\ | ||||
| 	RCU_LOCKDEP_WARN(!rcu_read_lock_held() &&			\ | ||||
| 			 !lockdep_is_held(&wq->mutex),			\ | ||||
| 			 "RCU or wq->mutex should be held") | ||||
| 
 | ||||
| #define assert_rcu_or_wq_mutex_or_pool_mutex(wq)			\ | ||||
| 	RCU_LOCKDEP_WARN(!rcu_read_lock_held() &&			\ | ||||
| 			 !lockdep_is_held(&wq->mutex) &&		\ | ||||
|  | @ -427,9 +422,7 @@ static void show_pwq(struct pool_workqueue *pwq); | |||
|  */ | ||||
| #define for_each_pwq(pwq, wq)						\ | ||||
| 	list_for_each_entry_rcu((pwq), &(wq)->pwqs, pwqs_node,		\ | ||||
| 				lockdep_is_held(&wq->mutex))		\ | ||||
| 		if (({ assert_rcu_or_wq_mutex(wq); false; })) { }	\ | ||||
| 		else | ||||
| 				 lockdep_is_held(&(wq->mutex))) | ||||
| 
 | ||||
| #ifdef CONFIG_DEBUG_OBJECTS_WORK | ||||
| 
 | ||||
|  |  | |||
|  | @ -1314,8 +1314,8 @@ int dev_set_alias(struct net_device *dev, const char *alias, size_t len) | |||
| 	} | ||||
| 
 | ||||
| 	mutex_lock(&ifalias_mutex); | ||||
| 	rcu_swap_protected(dev->ifalias, new_alias, | ||||
| 			   mutex_is_locked(&ifalias_mutex)); | ||||
| 	new_alias = rcu_replace_pointer(dev->ifalias, new_alias, | ||||
| 					mutex_is_locked(&ifalias_mutex)); | ||||
| 	mutex_unlock(&ifalias_mutex); | ||||
| 
 | ||||
| 	if (new_alias) | ||||
|  |  | |||
|  | @ -356,8 +356,8 @@ int reuseport_detach_prog(struct sock *sk) | |||
| 	spin_lock_bh(&reuseport_lock); | ||||
| 	reuse = rcu_dereference_protected(sk->sk_reuseport_cb, | ||||
| 					  lockdep_is_held(&reuseport_lock)); | ||||
| 	rcu_swap_protected(reuse->prog, old_prog, | ||||
| 			   lockdep_is_held(&reuseport_lock)); | ||||
| 	old_prog = rcu_replace_pointer(reuse->prog, old_prog, | ||||
| 				       lockdep_is_held(&reuseport_lock)); | ||||
| 	spin_unlock_bh(&reuseport_lock); | ||||
| 
 | ||||
| 	if (!old_prog) | ||||
|  |  | |||
|  | @ -1557,8 +1557,9 @@ static void nft_chain_stats_replace(struct nft_trans *trans) | |||
| 	if (!nft_trans_chain_stats(trans)) | ||||
| 		return; | ||||
| 
 | ||||
| 	rcu_swap_protected(chain->stats, nft_trans_chain_stats(trans), | ||||
| 			   lockdep_commit_lock_is_held(trans->ctx.net)); | ||||
| 	nft_trans_chain_stats(trans) = | ||||
| 		rcu_replace_pointer(chain->stats, nft_trans_chain_stats(trans), | ||||
| 				    lockdep_commit_lock_is_held(trans->ctx.net)); | ||||
| 
 | ||||
| 	if (!nft_trans_chain_stats(trans)) | ||||
| 		static_branch_inc(&nft_counters_enabled); | ||||
|  |  | |||
|  | @ -88,7 +88,7 @@ struct tcf_chain *tcf_action_set_ctrlact(struct tc_action *a, int action, | |||
| 					 struct tcf_chain *goto_chain) | ||||
| { | ||||
| 	a->tcfa_action = action; | ||||
| 	rcu_swap_protected(a->goto_chain, goto_chain, 1); | ||||
| 	goto_chain = rcu_replace_pointer(a->goto_chain, goto_chain, 1); | ||||
| 	return goto_chain; | ||||
| } | ||||
| EXPORT_SYMBOL(tcf_action_set_ctrlact); | ||||
|  |  | |||
|  | @ -101,8 +101,8 @@ static int tcf_csum_init(struct net *net, struct nlattr *nla, | |||
| 
 | ||||
| 	spin_lock_bh(&p->tcf_lock); | ||||
| 	goto_ch = tcf_action_set_ctrlact(*a, parm->action, goto_ch); | ||||
| 	rcu_swap_protected(p->params, params_new, | ||||
| 			   lockdep_is_held(&p->tcf_lock)); | ||||
| 	params_new = rcu_replace_pointer(p->params, params_new, | ||||
| 					 lockdep_is_held(&p->tcf_lock)); | ||||
| 	spin_unlock_bh(&p->tcf_lock); | ||||
| 
 | ||||
| 	if (goto_ch) | ||||
|  |  | |||
|  | @ -721,7 +721,8 @@ static int tcf_ct_init(struct net *net, struct nlattr *nla, | |||
| 
 | ||||
| 	spin_lock_bh(&c->tcf_lock); | ||||
| 	goto_ch = tcf_action_set_ctrlact(*a, parm->action, goto_ch); | ||||
| 	rcu_swap_protected(c->params, params, lockdep_is_held(&c->tcf_lock)); | ||||
| 	params = rcu_replace_pointer(c->params, params, | ||||
| 				     lockdep_is_held(&c->tcf_lock)); | ||||
| 	spin_unlock_bh(&c->tcf_lock); | ||||
| 
 | ||||
| 	if (goto_ch) | ||||
|  |  | |||
|  | @ -257,8 +257,8 @@ static int tcf_ctinfo_init(struct net *net, struct nlattr *nla, | |||
| 
 | ||||
| 	spin_lock_bh(&ci->tcf_lock); | ||||
| 	goto_ch = tcf_action_set_ctrlact(*a, actparm->action, goto_ch); | ||||
| 	rcu_swap_protected(ci->params, cp_new, | ||||
| 			   lockdep_is_held(&ci->tcf_lock)); | ||||
| 	cp_new = rcu_replace_pointer(ci->params, cp_new, | ||||
| 				     lockdep_is_held(&ci->tcf_lock)); | ||||
| 	spin_unlock_bh(&ci->tcf_lock); | ||||
| 
 | ||||
| 	if (goto_ch) | ||||
|  |  | |||
|  | @ -595,7 +595,7 @@ static int tcf_ife_init(struct net *net, struct nlattr *nla, | |||
| 		spin_lock_bh(&ife->tcf_lock); | ||||
| 	/* protected by tcf_lock when modifying existing action */ | ||||
| 	goto_ch = tcf_action_set_ctrlact(*a, parm->action, goto_ch); | ||||
| 	rcu_swap_protected(ife->params, p, 1); | ||||
| 	p = rcu_replace_pointer(ife->params, p, 1); | ||||
| 
 | ||||
| 	if (exists) | ||||
| 		spin_unlock_bh(&ife->tcf_lock); | ||||
|  |  | |||
|  | @ -178,8 +178,8 @@ static int tcf_mirred_init(struct net *net, struct nlattr *nla, | |||
| 			goto put_chain; | ||||
| 		} | ||||
| 		mac_header_xmit = dev_is_mac_header_xmit(dev); | ||||
| 		rcu_swap_protected(m->tcfm_dev, dev, | ||||
| 				   lockdep_is_held(&m->tcf_lock)); | ||||
| 		dev = rcu_replace_pointer(m->tcfm_dev, dev, | ||||
| 					  lockdep_is_held(&m->tcf_lock)); | ||||
| 		if (dev) | ||||
| 			dev_put(dev); | ||||
| 		m->tcfm_mac_header_xmit = mac_header_xmit; | ||||
|  |  | |||
|  | @ -262,7 +262,7 @@ static int tcf_mpls_init(struct net *net, struct nlattr *nla, | |||
| 
 | ||||
| 	spin_lock_bh(&m->tcf_lock); | ||||
| 	goto_ch = tcf_action_set_ctrlact(*a, parm->action, goto_ch); | ||||
| 	rcu_swap_protected(m->mpls_p, p, lockdep_is_held(&m->tcf_lock)); | ||||
| 	p = rcu_replace_pointer(m->mpls_p, p, lockdep_is_held(&m->tcf_lock)); | ||||
| 	spin_unlock_bh(&m->tcf_lock); | ||||
| 
 | ||||
| 	if (goto_ch) | ||||
|  |  | |||
|  | @ -191,9 +191,9 @@ static int tcf_police_init(struct net *net, struct nlattr *nla, | |||
| 		police->tcfp_ptoks = new->tcfp_mtu_ptoks; | ||||
| 	spin_unlock_bh(&police->tcfp_lock); | ||||
| 	goto_ch = tcf_action_set_ctrlact(*a, parm->action, goto_ch); | ||||
| 	rcu_swap_protected(police->params, | ||||
| 			   new, | ||||
| 			   lockdep_is_held(&police->tcf_lock)); | ||||
| 	new = rcu_replace_pointer(police->params, | ||||
| 				  new, | ||||
| 				  lockdep_is_held(&police->tcf_lock)); | ||||
| 	spin_unlock_bh(&police->tcf_lock); | ||||
| 
 | ||||
| 	if (goto_ch) | ||||
|  |  | |||
|  | @ -102,8 +102,8 @@ static int tcf_sample_init(struct net *net, struct nlattr *nla, | |||
| 	goto_ch = tcf_action_set_ctrlact(*a, parm->action, goto_ch); | ||||
| 	s->rate = rate; | ||||
| 	s->psample_group_num = psample_group_num; | ||||
| 	rcu_swap_protected(s->psample_group, psample_group, | ||||
| 			   lockdep_is_held(&s->tcf_lock)); | ||||
| 	psample_group = rcu_replace_pointer(s->psample_group, psample_group, | ||||
| 					    lockdep_is_held(&s->tcf_lock)); | ||||
| 
 | ||||
| 	if (tb[TCA_SAMPLE_TRUNC_SIZE]) { | ||||
| 		s->truncate = true; | ||||
|  |  | |||
|  | @ -206,8 +206,8 @@ static int tcf_skbedit_init(struct net *net, struct nlattr *nla, | |||
| 
 | ||||
| 	spin_lock_bh(&d->tcf_lock); | ||||
| 	goto_ch = tcf_action_set_ctrlact(*a, parm->action, goto_ch); | ||||
| 	rcu_swap_protected(d->params, params_new, | ||||
| 			   lockdep_is_held(&d->tcf_lock)); | ||||
| 	params_new = rcu_replace_pointer(d->params, params_new, | ||||
| 					 lockdep_is_held(&d->tcf_lock)); | ||||
| 	spin_unlock_bh(&d->tcf_lock); | ||||
| 	if (params_new) | ||||
| 		kfree_rcu(params_new, rcu); | ||||
|  |  | |||
|  | @ -529,8 +529,8 @@ static int tunnel_key_init(struct net *net, struct nlattr *nla, | |||
| 
 | ||||
| 	spin_lock_bh(&t->tcf_lock); | ||||
| 	goto_ch = tcf_action_set_ctrlact(*a, parm->action, goto_ch); | ||||
| 	rcu_swap_protected(t->params, params_new, | ||||
| 			   lockdep_is_held(&t->tcf_lock)); | ||||
| 	params_new = rcu_replace_pointer(t->params, params_new, | ||||
| 					 lockdep_is_held(&t->tcf_lock)); | ||||
| 	spin_unlock_bh(&t->tcf_lock); | ||||
| 	tunnel_key_release_params(params_new); | ||||
| 	if (goto_ch) | ||||
|  |  | |||
|  | @ -221,7 +221,7 @@ static int tcf_vlan_init(struct net *net, struct nlattr *nla, | |||
| 
 | ||||
| 	spin_lock_bh(&v->tcf_lock); | ||||
| 	goto_ch = tcf_action_set_ctrlact(*a, parm->action, goto_ch); | ||||
| 	rcu_swap_protected(v->vlan_p, p, lockdep_is_held(&v->tcf_lock)); | ||||
| 	p = rcu_replace_pointer(v->vlan_p, p, lockdep_is_held(&v->tcf_lock)); | ||||
| 	spin_unlock_bh(&v->tcf_lock); | ||||
| 
 | ||||
| 	if (goto_ch) | ||||
|  |  | |||
|  | @ -179,8 +179,8 @@ out_free_rule: | |||
| 	 * doesn't currently exist, just use a spinlock for now. | ||||
| 	 */ | ||||
| 	mutex_lock(&policy_update_lock); | ||||
| 	rcu_swap_protected(safesetid_setuid_rules, pol, | ||||
| 			   lockdep_is_held(&policy_update_lock)); | ||||
| 	pol = rcu_replace_pointer(safesetid_setuid_rules, pol, | ||||
| 				  lockdep_is_held(&policy_update_lock)); | ||||
| 	mutex_unlock(&policy_update_lock); | ||||
| 	err = len; | ||||
| 
 | ||||
|  |  | |||
|  | @ -27,9 +27,10 @@ Explanation of the Linux-Kernel Memory Consistency Model | |||
|   19. AND THEN THERE WAS ALPHA | ||||
|   20. THE HAPPENS-BEFORE RELATION: hb | ||||
|   21. THE PROPAGATES-BEFORE RELATION: pb | ||||
|   22. RCU RELATIONS: rcu-link, rcu-gp, rcu-rscsi, rcu-fence, and rb | ||||
|   22. RCU RELATIONS: rcu-link, rcu-gp, rcu-rscsi, rcu-order, rcu-fence, and rb | ||||
|   23. LOCKING | ||||
|   24. ODDS AND ENDS | ||||
|   24. PLAIN ACCESSES AND DATA RACES | ||||
|   25. ODDS AND ENDS | ||||
| 
 | ||||
| 
 | ||||
| 
 | ||||
|  | @ -42,8 +43,7 @@ linux-kernel.bell and linux-kernel.cat files that make up the formal | |||
| version of the model; they are extremely terse and their meanings are | ||||
| far from clear. | ||||
| 
 | ||||
| This document describes the ideas underlying the LKMM, but excluding | ||||
| the modeling of bare C (or plain) shared memory accesses.  It is meant | ||||
| This document describes the ideas underlying the LKMM.  It is meant | ||||
| for people who want to understand how the model was designed.  It does | ||||
| not go into the details of the code in the .bell and .cat files; | ||||
| rather, it explains in English what the code expresses symbolically. | ||||
|  | @ -206,7 +206,7 @@ goes like this: | |||
| 	P0 stores 1 to buf before storing 1 to flag, since it executes | ||||
| 	its instructions in order. | ||||
| 
 | ||||
| 	Since an instruction (in this case, P1's store to flag) cannot | ||||
| 	Since an instruction (in this case, P0's store to flag) cannot | ||||
| 	execute before itself, the specified outcome is impossible. | ||||
| 
 | ||||
| However, real computer hardware almost never follows the Sequential | ||||
|  | @ -419,7 +419,7 @@ example: | |||
| 
 | ||||
| The object code might call f(5) either before or after g(6); the | ||||
| memory model cannot assume there is a fixed program order relation | ||||
| between them.  (In fact, if the functions are inlined then the | ||||
| between them.  (In fact, if the function calls are inlined then the | ||||
| compiler might even interleave their object code.) | ||||
| 
 | ||||
| 
 | ||||
|  | @ -499,7 +499,7 @@ different CPUs (external reads-from, or rfe). | |||
| 
 | ||||
| For our purposes, a memory location's initial value is treated as | ||||
| though it had been written there by an imaginary initial store that | ||||
| executes on a separate CPU before the program runs. | ||||
| executes on a separate CPU before the main program runs. | ||||
| 
 | ||||
| Usage of the rf relation implicitly assumes that loads will always | ||||
| read from a single store.  It doesn't apply properly in the presence | ||||
|  | @ -857,7 +857,7 @@ outlined above.  These restrictions involve the necessity of | |||
| maintaining cache coherence and the fact that a CPU can't operate on a | ||||
| value before it knows what that value is, among other things. | ||||
| 
 | ||||
| The formal version of the LKMM is defined by five requirements, or | ||||
| The formal version of the LKMM is defined by six requirements, or | ||||
| axioms: | ||||
| 
 | ||||
| 	Sequential consistency per variable: This requires that the | ||||
|  | @ -877,10 +877,14 @@ axioms: | |||
| 	grace periods obey the rules of RCU, in particular, the | ||||
| 	Grace-Period Guarantee. | ||||
| 
 | ||||
| 	Plain-coherence: This requires that plain memory accesses | ||||
| 	(those not using READ_ONCE(), WRITE_ONCE(), etc.) must obey | ||||
| 	the operational model's rules regarding cache coherence. | ||||
| 
 | ||||
| The first and second are quite common; they can be found in many | ||||
| memory models (such as those for C11/C++11).  The "happens-before" and | ||||
| "propagation" axioms have analogs in other memory models as well.  The | ||||
| "rcu" axiom is specific to the LKMM. | ||||
| "rcu" and "plain-coherence" axioms are specific to the LKMM. | ||||
| 
 | ||||
| Each of these axioms is discussed below. | ||||
| 
 | ||||
|  | @ -955,7 +959,7 @@ atomic update.  This is what the LKMM's "atomic" axiom says. | |||
| THE PRESERVED PROGRAM ORDER RELATION: ppo | ||||
| ----------------------------------------- | ||||
| 
 | ||||
| There are many situations where a CPU is obligated to execute two | ||||
| There are many situations where a CPU is obliged to execute two | ||||
| instructions in program order.  We amalgamate them into the ppo (for | ||||
| "preserved program order") relation, which links the po-earlier | ||||
| instruction to the po-later instruction and is thus a sub-relation of | ||||
|  | @ -1425,8 +1429,8 @@ they execute means that it cannot have cycles.  This requirement is | |||
| the content of the LKMM's "propagation" axiom. | ||||
| 
 | ||||
| 
 | ||||
| RCU RELATIONS: rcu-link, rcu-gp, rcu-rscsi, rcu-fence, and rb | ||||
| ------------------------------------------------------------- | ||||
| RCU RELATIONS: rcu-link, rcu-gp, rcu-rscsi, rcu-order, rcu-fence, and rb | ||||
| ------------------------------------------------------------------------ | ||||
| 
 | ||||
| RCU (Read-Copy-Update) is a powerful synchronization mechanism.  It | ||||
| rests on two concepts: grace periods and read-side critical sections. | ||||
|  | @ -1536,29 +1540,29 @@ Z's CPU before Z begins but doesn't propagate to some other CPU until | |||
| after X ends.)  Similarly, X ->rcu-rscsi Y ->rcu-link Z says that X is | ||||
| the end of a critical section which starts before Z begins. | ||||
| 
 | ||||
| The LKMM goes on to define the rcu-fence relation as a sequence of | ||||
| The LKMM goes on to define the rcu-order relation as a sequence of | ||||
| rcu-gp and rcu-rscsi links separated by rcu-link links, in which the | ||||
| number of rcu-gp links is >= the number of rcu-rscsi links.  For | ||||
| example: | ||||
| 
 | ||||
| 	X ->rcu-gp Y ->rcu-link Z ->rcu-rscsi T ->rcu-link U ->rcu-gp V | ||||
| 
 | ||||
| would imply that X ->rcu-fence V, because this sequence contains two | ||||
| would imply that X ->rcu-order V, because this sequence contains two | ||||
| rcu-gp links and one rcu-rscsi link.  (It also implies that | ||||
| X ->rcu-fence T and Z ->rcu-fence V.)  On the other hand: | ||||
| X ->rcu-order T and Z ->rcu-order V.)  On the other hand: | ||||
| 
 | ||||
| 	X ->rcu-rscsi Y ->rcu-link Z ->rcu-rscsi T ->rcu-link U ->rcu-gp V | ||||
| 
 | ||||
| does not imply X ->rcu-fence V, because the sequence contains only | ||||
| does not imply X ->rcu-order V, because the sequence contains only | ||||
| one rcu-gp link but two rcu-rscsi links. | ||||
| 
 | ||||
| The rcu-fence relation is important because the Grace Period Guarantee | ||||
| means that rcu-fence acts kind of like a strong fence.  In particular, | ||||
| E ->rcu-fence F implies not only that E begins before F ends, but also | ||||
| that any write po-before E will propagate to every CPU before any | ||||
| instruction po-after F can execute.  (However, it does not imply that | ||||
| E must execute before F; in fact, each synchronize_rcu() fence event | ||||
| is linked to itself by rcu-fence as a degenerate case.) | ||||
| The rcu-order relation is important because the Grace Period Guarantee | ||||
| means that rcu-order links act kind of like strong fences.  In | ||||
| particular, E ->rcu-order F implies not only that E begins before F | ||||
| ends, but also that any write po-before E will propagate to every CPU | ||||
| before any instruction po-after F can execute.  (However, it does not | ||||
| imply that E must execute before F; in fact, each synchronize_rcu() | ||||
| fence event is linked to itself by rcu-order as a degenerate case.) | ||||
| 
 | ||||
| To prove this in full generality requires some intellectual effort. | ||||
| We'll consider just a very simple case: | ||||
|  | @ -1572,7 +1576,7 @@ and there are events X, Y and a read-side critical section C such that: | |||
| 
 | ||||
| 	2. X comes "before" Y in some sense (including rfe, co and fr); | ||||
| 
 | ||||
| 	2. Y is po-before Z; | ||||
| 	3. Y is po-before Z; | ||||
| 
 | ||||
| 	4. Z is the rcu_read_unlock() event marking the end of C; | ||||
| 
 | ||||
|  | @ -1585,7 +1589,26 @@ G's CPU before G starts must propagate to every CPU before C starts. | |||
| In particular, the write propagates to every CPU before F finishes | ||||
| executing and hence before any instruction po-after F can execute. | ||||
| This sort of reasoning can be extended to handle all the situations | ||||
| covered by rcu-fence. | ||||
| covered by rcu-order. | ||||
| 
 | ||||
| The rcu-fence relation is a simple extension of rcu-order.  While | ||||
| rcu-order only links certain fence events (calls to synchronize_rcu(), | ||||
| rcu_read_lock(), or rcu_read_unlock()), rcu-fence links any events | ||||
| that are separated by an rcu-order link.  This is analogous to the way | ||||
| the strong-fence relation links events that are separated by an | ||||
| smp_mb() fence event (as mentioned above, rcu-order links act kind of | ||||
| like strong fences).  Written symbolically, X ->rcu-fence Y means | ||||
| there are fence events E and F such that: | ||||
| 
 | ||||
| 	X ->po E ->rcu-order F ->po Y. | ||||
| 
 | ||||
| From the discussion above, we see this implies not only that X | ||||
| executes before Y, but also (if X is a store) that X propagates to | ||||
| every CPU before Y executes.  Thus rcu-fence is sort of a | ||||
| "super-strong" fence: Unlike the original strong fences (smp_mb() and | ||||
| synchronize_rcu()), rcu-fence is able to link events on different | ||||
| CPUs.  (Perhaps this fact should lead us to say that rcu-fence isn't | ||||
| really a fence at all!) | ||||
| 
 | ||||
| Finally, the LKMM defines the RCU-before (rb) relation in terms of | ||||
| rcu-fence.  This is done in essentially the same way as the pb | ||||
|  | @ -1596,7 +1619,7 @@ before F, just as E ->pb F does (and for much the same reasons). | |||
| Putting this all together, the LKMM expresses the Grace Period | ||||
| Guarantee by requiring that the rb relation does not contain a cycle. | ||||
| Equivalently, this "rcu" axiom requires that there are no events E | ||||
| and F with E ->rcu-link F ->rcu-fence E.  Or to put it a third way, | ||||
| and F with E ->rcu-link F ->rcu-order E.  Or to put it a third way, | ||||
| the axiom requires that there are no cycles consisting of rcu-gp and | ||||
| rcu-rscsi alternating with rcu-link, where the number of rcu-gp links | ||||
| is >= the number of rcu-rscsi links. | ||||
|  | @ -1750,7 +1773,7 @@ addition to normal RCU.  The ideas involved are much the same as | |||
| above, with new relations srcu-gp and srcu-rscsi added to represent | ||||
| SRCU grace periods and read-side critical sections.  There is a | ||||
| restriction on the srcu-gp and srcu-rscsi links that can appear in an | ||||
| rcu-fence sequence (the srcu-rscsi links must be paired with srcu-gp | ||||
| rcu-order sequence (the srcu-rscsi links must be paired with srcu-gp | ||||
| links having the same SRCU domain with proper nesting); the details | ||||
| are relatively unimportant. | ||||
| 
 | ||||
|  | @ -1896,6 +1919,521 @@ architectures supported by the Linux kernel, albeit for various | |||
| differing reasons. | ||||
| 
 | ||||
| 
 | ||||
| PLAIN ACCESSES AND DATA RACES | ||||
| ----------------------------- | ||||
| 
 | ||||
| In the LKMM, memory accesses such as READ_ONCE(x), atomic_inc(&y), | ||||
| smp_load_acquire(&z), and so on are collectively referred to as | ||||
| "marked" accesses, because they are all annotated with special | ||||
| operations of one kind or another.  Ordinary C-language memory | ||||
| accesses such as x or y = 0 are simply called "plain" accesses. | ||||
| 
 | ||||
| Early versions of the LKMM had nothing to say about plain accesses. | ||||
| The C standard allows compilers to assume that the variables affected | ||||
| by plain accesses are not concurrently read or written by any other | ||||
| threads or CPUs.  This leaves compilers free to implement all manner | ||||
| of transformations or optimizations of code containing plain accesses, | ||||
| making such code very difficult for a memory model to handle. | ||||
| 
 | ||||
| Here is just one example of a possible pitfall: | ||||
| 
 | ||||
| 	int a = 6; | ||||
| 	int *x = &a; | ||||
| 
 | ||||
| 	P0() | ||||
| 	{ | ||||
| 		int *r1; | ||||
| 		int r2 = 0; | ||||
| 
 | ||||
| 		r1 = x; | ||||
| 		if (r1 != NULL) | ||||
| 			r2 = READ_ONCE(*r1); | ||||
| 	} | ||||
| 
 | ||||
| 	P1() | ||||
| 	{ | ||||
| 		WRITE_ONCE(x, NULL); | ||||
| 	} | ||||
| 
 | ||||
| On the face of it, one would expect that when this code runs, the only | ||||
| possible final values for r2 are 6 and 0, depending on whether or not | ||||
| P1's store to x propagates to P0 before P0's load from x executes. | ||||
| But since P0's load from x is a plain access, the compiler may decide | ||||
| to carry out the load twice (for the comparison against NULL, then again | ||||
| for the READ_ONCE()) and eliminate the temporary variable r1.  The | ||||
| object code generated for P0 could therefore end up looking rather | ||||
| like this: | ||||
| 
 | ||||
| 	P0() | ||||
| 	{ | ||||
| 		int r2 = 0; | ||||
| 
 | ||||
| 		if (x != NULL) | ||||
| 			r2 = READ_ONCE(*x); | ||||
| 	} | ||||
| 
 | ||||
| And now it is obvious that this code runs the risk of dereferencing a | ||||
| NULL pointer, because P1's store to x might propagate to P0 after the | ||||
| test against NULL has been made but before the READ_ONCE() executes. | ||||
| If the original code had said "r1 = READ_ONCE(x)" instead of "r1 = x", | ||||
| the compiler would not have performed this optimization and there | ||||
| would be no possibility of a NULL-pointer dereference. | ||||
| 
 | ||||
| Given the possibility of transformations like this one, the LKMM | ||||
| doesn't try to predict all possible outcomes of code containing plain | ||||
| accesses.  It is instead content to determine whether the code | ||||
| violates the compiler's assumptions, which would render the ultimate | ||||
| outcome undefined. | ||||
| 
 | ||||
| In technical terms, the compiler is allowed to assume that when the | ||||
| program executes, there will not be any data races.  A "data race" | ||||
| occurs when two conflicting memory accesses execute concurrently; | ||||
| two memory accesses "conflict" if: | ||||
| 
 | ||||
| 	they access the same location, | ||||
| 
 | ||||
| 	they occur on different CPUs (or in different threads on the | ||||
| 	same CPU), | ||||
| 
 | ||||
| 	at least one of them is a plain access, | ||||
| 
 | ||||
| 	and at least one of them is a store. | ||||
| 
 | ||||
| The LKMM tries to determine whether a program contains two conflicting | ||||
| accesses which may execute concurrently; if it does then the LKMM says | ||||
| there is a potential data race and makes no predictions about the | ||||
| program's outcome. | ||||
| 
 | ||||
| Determining whether two accesses conflict is easy; you can see that | ||||
| all the concepts involved in the definition above are already part of | ||||
| the memory model.  The hard part is telling whether they may execute | ||||
| concurrently.  The LKMM takes a conservative attitude, assuming that | ||||
| accesses may be concurrent unless it can prove they cannot. | ||||
| 
 | ||||
| If two memory accesses aren't concurrent then one must execute before | ||||
| the other.  Therefore the LKMM decides two accesses aren't concurrent | ||||
| if they can be connected by a sequence of hb, pb, and rb links | ||||
| (together referred to as xb, for "executes before").  However, there | ||||
| are two complicating factors. | ||||
| 
 | ||||
| If X is a load and X executes before a store Y, then indeed there is | ||||
| no danger of X and Y being concurrent.  After all, Y can't have any | ||||
| effect on the value obtained by X until the memory subsystem has | ||||
| propagated Y from its own CPU to X's CPU, which won't happen until | ||||
| some time after Y executes and thus after X executes.  But if X is a | ||||
| store, then even if X executes before Y it is still possible that X | ||||
| will propagate to Y's CPU just as Y is executing.  In such a case X | ||||
| could very well interfere somehow with Y, and we would have to | ||||
| consider X and Y to be concurrent. | ||||
| 
 | ||||
| Therefore when X is a store, for X and Y to be non-concurrent the LKMM | ||||
| requires not only that X must execute before Y but also that X must | ||||
| propagate to Y's CPU before Y executes.  (Or vice versa, of course, if | ||||
| Y executes before X -- then Y must propagate to X's CPU before X | ||||
| executes if Y is a store.)  This is expressed by the visibility | ||||
| relation (vis), where X ->vis Y is defined to hold if there is an | ||||
| intermediate event Z such that: | ||||
| 
 | ||||
| 	X is connected to Z by a possibly empty sequence of | ||||
| 	cumul-fence links followed by an optional rfe link (if none of | ||||
| 	these links are present, X and Z are the same event), | ||||
| 
 | ||||
| and either: | ||||
| 
 | ||||
| 	Z is connected to Y by a strong-fence link followed by a | ||||
| 	possibly empty sequence of xb links, | ||||
| 
 | ||||
| or: | ||||
| 
 | ||||
| 	Z is on the same CPU as Y and is connected to Y by a possibly | ||||
| 	empty sequence of xb links (again, if the sequence is empty it | ||||
| 	means Z and Y are the same event). | ||||
| 
 | ||||
| The motivations behind this definition are straightforward: | ||||
| 
 | ||||
| 	cumul-fence memory barriers force stores that are po-before | ||||
| 	the barrier to propagate to other CPUs before stores that are | ||||
| 	po-after the barrier. | ||||
| 
 | ||||
| 	An rfe link from an event W to an event R says that R reads | ||||
| 	from W, which certainly means that W must have propagated to | ||||
| 	R's CPU before R executed. | ||||
| 
 | ||||
| 	strong-fence memory barriers force stores that are po-before | ||||
| 	the barrier, or that propagate to the barrier's CPU before the | ||||
| 	barrier executes, to propagate to all CPUs before any events | ||||
| 	po-after the barrier can execute. | ||||
| 
 | ||||
| To see how this works out in practice, consider our old friend, the MP | ||||
| pattern (with fences and statement labels, but without the conditional | ||||
| test): | ||||
| 
 | ||||
| 	int buf = 0, flag = 0; | ||||
| 
 | ||||
| 	P0() | ||||
| 	{ | ||||
| 		X: WRITE_ONCE(buf, 1); | ||||
| 		   smp_wmb(); | ||||
| 		W: WRITE_ONCE(flag, 1); | ||||
| 	} | ||||
| 
 | ||||
| 	P1() | ||||
| 	{ | ||||
| 		int r1; | ||||
| 		int r2 = 0; | ||||
| 
 | ||||
| 		Z: r1 = READ_ONCE(flag); | ||||
| 		   smp_rmb(); | ||||
| 		Y: r2 = READ_ONCE(buf); | ||||
| 	} | ||||
| 
 | ||||
| The smp_wmb() memory barrier gives a cumul-fence link from X to W, and | ||||
| assuming r1 = 1 at the end, there is an rfe link from W to Z.  This | ||||
| means that the store to buf must propagate from P0 to P1 before Z | ||||
| executes.  Next, Z and Y are on the same CPU and the smp_rmb() fence | ||||
| provides an xb link from Z to Y (i.e., it forces Z to execute before | ||||
| Y).  Therefore we have X ->vis Y: X must propagate to Y's CPU before Y | ||||
| executes. | ||||
| 
 | ||||
| The second complicating factor mentioned above arises from the fact | ||||
| that when we are considering data races, some of the memory accesses | ||||
| are plain.  Now, although we have not said so explicitly, up to this | ||||
| point most of the relations defined by the LKMM (ppo, hb, prop, | ||||
| cumul-fence, pb, and so on -- including vis) apply only to marked | ||||
| accesses. | ||||
| 
 | ||||
| There are good reasons for this restriction.  The compiler is not | ||||
| allowed to apply fancy transformations to marked accesses, and | ||||
| consequently each such access in the source code corresponds more or | ||||
| less directly to a single machine instruction in the object code.  But | ||||
| plain accesses are a different story; the compiler may combine them, | ||||
| split them up, duplicate them, eliminate them, invent new ones, and | ||||
| who knows what else.  Seeing a plain access in the source code tells | ||||
| you almost nothing about what machine instructions will end up in the | ||||
| object code. | ||||
| 
 | ||||
| Fortunately, the compiler isn't completely free; it is subject to some | ||||
| limitations.  For one, it is not allowed to introduce a data race into | ||||
| the object code if the source code does not already contain a data | ||||
| race (if it could, memory models would be useless and no multithreaded | ||||
| code would be safe!).  For another, it cannot move a plain access past | ||||
| a compiler barrier. | ||||
| 
 | ||||
| A compiler barrier is a kind of fence, but as the name implies, it | ||||
| only affects the compiler; it does not necessarily have any effect on | ||||
| how instructions are executed by the CPU.  In Linux kernel source | ||||
| code, the barrier() function is a compiler barrier.  It doesn't give | ||||
| rise directly to any machine instructions in the object code; rather, | ||||
| it affects how the compiler generates the rest of the object code. | ||||
| Given source code like this: | ||||
| 
 | ||||
| 	... some memory accesses ... | ||||
| 	barrier(); | ||||
| 	... some other memory accesses ... | ||||
| 
 | ||||
| the barrier() function ensures that the machine instructions | ||||
| corresponding to the first group of accesses will all end po-before | ||||
| any machine instructions corresponding to the second group of accesses | ||||
| -- even if some of the accesses are plain.  (Of course, the CPU may | ||||
| then execute some of those accesses out of program order, but we | ||||
| already know how to deal with such issues.)  Without the barrier() | ||||
| there would be no such guarantee; the two groups of accesses could be | ||||
| intermingled or even reversed in the object code. | ||||
| 
 | ||||
| The LKMM doesn't say much about the barrier() function, but it does | ||||
| require that all fences are also compiler barriers.  In addition, it | ||||
| requires that the ordering properties of memory barriers such as | ||||
| smp_rmb() or smp_store_release() apply to plain accesses as well as to | ||||
| marked accesses. | ||||
| 
 | ||||
| This is the key to analyzing data races.  Consider the MP pattern | ||||
| again, now using plain accesses for buf: | ||||
| 
 | ||||
| 	int buf = 0, flag = 0; | ||||
| 
 | ||||
| 	P0() | ||||
| 	{ | ||||
| 		U: buf = 1; | ||||
| 		   smp_wmb(); | ||||
| 		X: WRITE_ONCE(flag, 1); | ||||
| 	} | ||||
| 
 | ||||
| 	P1() | ||||
| 	{ | ||||
| 		int r1; | ||||
| 		int r2 = 0; | ||||
| 
 | ||||
| 		Y: r1 = READ_ONCE(flag); | ||||
| 		   if (r1) { | ||||
| 			   smp_rmb(); | ||||
| 			V: r2 = buf; | ||||
| 		   } | ||||
| 	} | ||||
| 
 | ||||
| This program does not contain a data race.  Although the U and V | ||||
| accesses conflict, the LKMM can prove they are not concurrent as | ||||
| follows: | ||||
| 
 | ||||
| 	The smp_wmb() fence in P0 is both a compiler barrier and a | ||||
| 	cumul-fence.  It guarantees that no matter what hash of | ||||
| 	machine instructions the compiler generates for the plain | ||||
| 	access U, all those instructions will be po-before the fence. | ||||
| 	Consequently U's store to buf, no matter how it is carried out | ||||
| 	at the machine level, must propagate to P1 before X's store to | ||||
| 	flag does. | ||||
| 
 | ||||
| 	X and Y are both marked accesses.  Hence an rfe link from X to | ||||
| 	Y is a valid indicator that X propagated to P1 before Y | ||||
| 	executed, i.e., X ->vis Y.  (And if there is no rfe link then | ||||
| 	r1 will be 0, so V will not be executed and ipso facto won't | ||||
| 	race with U.) | ||||
| 
 | ||||
| 	The smp_rmb() fence in P1 is a compiler barrier as well as a | ||||
| 	fence.  It guarantees that all the machine-level instructions | ||||
| 	corresponding to the access V will be po-after the fence, and | ||||
| 	therefore any loads among those instructions will execute | ||||
| 	after the fence does and hence after Y does. | ||||
| 
 | ||||
| Thus U's store to buf is forced to propagate to P1 before V's load | ||||
| executes (assuming V does execute), ruling out the possibility of a | ||||
| data race between them. | ||||
| 
 | ||||
| This analysis illustrates how the LKMM deals with plain accesses in | ||||
| general.  Suppose R is a plain load and we want to show that R | ||||
| executes before some marked access E.  We can do this by finding a | ||||
| marked access X such that R and X are ordered by a suitable fence and | ||||
| X ->xb* E.  If E was also a plain access, we would also look for a | ||||
| marked access Y such that X ->xb* Y, and Y and E are ordered by a | ||||
| fence.  We describe this arrangement by saying that R is | ||||
| "post-bounded" by X and E is "pre-bounded" by Y. | ||||
| 
 | ||||
| In fact, we go one step further: Since R is a read, we say that R is | ||||
| "r-post-bounded" by X.  Similarly, E would be "r-pre-bounded" or | ||||
| "w-pre-bounded" by Y, depending on whether E was a store or a load. | ||||
| This distinction is needed because some fences affect only loads | ||||
| (i.e., smp_rmb()) and some affect only stores (smp_wmb()); otherwise | ||||
| the two types of bounds are the same.  And as a degenerate case, we | ||||
| say that a marked access pre-bounds and post-bounds itself (e.g., if R | ||||
| above were a marked load then X could simply be taken to be R itself.) | ||||
| 
 | ||||
| The need to distinguish between r- and w-bounding raises yet another | ||||
| issue.  When the source code contains a plain store, the compiler is | ||||
| allowed to put plain loads of the same location into the object code. | ||||
| For example, given the source code: | ||||
| 
 | ||||
| 	x = 1; | ||||
| 
 | ||||
| the compiler is theoretically allowed to generate object code that | ||||
| looks like: | ||||
| 
 | ||||
| 	if (x != 1) | ||||
| 		x = 1; | ||||
| 
 | ||||
| thereby adding a load (and possibly replacing the store entirely). | ||||
| For this reason, whenever the LKMM requires a plain store to be | ||||
| w-pre-bounded or w-post-bounded by a marked access, it also requires | ||||
| the store to be r-pre-bounded or r-post-bounded, so as to handle cases | ||||
| where the compiler adds a load. | ||||
| 
 | ||||
| (This may be overly cautious.  We don't know of any examples where a | ||||
| compiler has augmented a store with a load in this fashion, and the | ||||
| Linux kernel developers would probably fight pretty hard to change a | ||||
| compiler if it ever did this.  Still, better safe than sorry.) | ||||
| 
 | ||||
| Incidentally, the other tranformation -- augmenting a plain load by | ||||
| adding in a store to the same location -- is not allowed.  This is | ||||
| because the compiler cannot know whether any other CPUs might perform | ||||
| a concurrent load from that location.  Two concurrent loads don't | ||||
| constitute a race (they can't interfere with each other), but a store | ||||
| does race with a concurrent load.  Thus adding a store might create a | ||||
| data race where one was not already present in the source code, | ||||
| something the compiler is forbidden to do.  Augmenting a store with a | ||||
| load, on the other hand, is acceptable because doing so won't create a | ||||
| data race unless one already existed. | ||||
| 
 | ||||
| The LKMM includes a second way to pre-bound plain accesses, in | ||||
| addition to fences: an address dependency from a marked load.  That | ||||
| is, in the sequence: | ||||
| 
 | ||||
| 	p = READ_ONCE(ptr); | ||||
| 	r = *p; | ||||
| 
 | ||||
| the LKMM says that the marked load of ptr pre-bounds the plain load of | ||||
| *p; the marked load must execute before any of the machine | ||||
| instructions corresponding to the plain load.  This is a reasonable | ||||
| stipulation, since after all, the CPU can't perform the load of *p | ||||
| until it knows what value p will hold.  Furthermore, without some | ||||
| assumption like this one, some usages typical of RCU would count as | ||||
| data races.  For example: | ||||
| 
 | ||||
| 	int a = 1, b; | ||||
| 	int *ptr = &a; | ||||
| 
 | ||||
| 	P0() | ||||
| 	{ | ||||
| 		b = 2; | ||||
| 		rcu_assign_pointer(ptr, &b); | ||||
| 	} | ||||
| 
 | ||||
| 	P1() | ||||
| 	{ | ||||
| 		int *p; | ||||
| 		int r; | ||||
| 
 | ||||
| 		rcu_read_lock(); | ||||
| 		p = rcu_dereference(ptr); | ||||
| 		r = *p; | ||||
| 		rcu_read_unlock(); | ||||
| 	} | ||||
| 
 | ||||
| (In this example the rcu_read_lock() and rcu_read_unlock() calls don't | ||||
| really do anything, because there aren't any grace periods.  They are | ||||
| included merely for the sake of good form; typically P0 would call | ||||
| synchronize_rcu() somewhere after the rcu_assign_pointer().) | ||||
| 
 | ||||
| rcu_assign_pointer() performs a store-release, so the plain store to b | ||||
| is definitely w-post-bounded before the store to ptr, and the two | ||||
| stores will propagate to P1 in that order.  However, rcu_dereference() | ||||
| is only equivalent to READ_ONCE().  While it is a marked access, it is | ||||
| not a fence or compiler barrier.  Hence the only guarantee we have | ||||
| that the load of ptr in P1 is r-pre-bounded before the load of *p | ||||
| (thus avoiding a race) is the assumption about address dependencies. | ||||
| 
 | ||||
| This is a situation where the compiler can undermine the memory model, | ||||
| and a certain amount of care is required when programming constructs | ||||
| like this one.  In particular, comparisons between the pointer and | ||||
| other known addresses can cause trouble.  If you have something like: | ||||
| 
 | ||||
| 	p = rcu_dereference(ptr); | ||||
| 	if (p == &x) | ||||
| 		r = *p; | ||||
| 
 | ||||
| then the compiler just might generate object code resembling: | ||||
| 
 | ||||
| 	p = rcu_dereference(ptr); | ||||
| 	if (p == &x) | ||||
| 		r = x; | ||||
| 
 | ||||
| or even: | ||||
| 
 | ||||
| 	rtemp = x; | ||||
| 	p = rcu_dereference(ptr); | ||||
| 	if (p == &x) | ||||
| 		r = rtemp; | ||||
| 
 | ||||
| which would invalidate the memory model's assumption, since the CPU | ||||
| could now perform the load of x before the load of ptr (there might be | ||||
| a control dependency but no address dependency at the machine level). | ||||
| 
 | ||||
| Finally, it turns out there is a situation in which a plain write does | ||||
| not need to be w-post-bounded: when it is separated from the | ||||
| conflicting access by a fence.  At first glance this may seem | ||||
| impossible.  After all, to be conflicting the second access has to be | ||||
| on a different CPU from the first, and fences don't link events on | ||||
| different CPUs.  Well, normal fences don't -- but rcu-fence can! | ||||
| Here's an example: | ||||
| 
 | ||||
| 	int x, y; | ||||
| 
 | ||||
| 	P0() | ||||
| 	{ | ||||
| 		WRITE_ONCE(x, 1); | ||||
| 		synchronize_rcu(); | ||||
| 		y = 3; | ||||
| 	} | ||||
| 
 | ||||
| 	P1() | ||||
| 	{ | ||||
| 		rcu_read_lock(); | ||||
| 		if (READ_ONCE(x) == 0) | ||||
| 			y = 2; | ||||
| 		rcu_read_unlock(); | ||||
| 	} | ||||
| 
 | ||||
| Do the plain stores to y race?  Clearly not if P1 reads a non-zero | ||||
| value for x, so let's assume the READ_ONCE(x) does obtain 0.  This | ||||
| means that the read-side critical section in P1 must finish executing | ||||
| before the grace period in P0 does, because RCU's Grace-Period | ||||
| Guarantee says that otherwise P0's store to x would have propagated to | ||||
| P1 before the critical section started and so would have been visible | ||||
| to the READ_ONCE().  (Another way of putting it is that the fre link | ||||
| from the READ_ONCE() to the WRITE_ONCE() gives rise to an rcu-link | ||||
| between those two events.) | ||||
| 
 | ||||
| This means there is an rcu-fence link from P1's "y = 2" store to P0's | ||||
| "y = 3" store, and consequently the first must propagate from P1 to P0 | ||||
| before the second can execute.  Therefore the two stores cannot be | ||||
| concurrent and there is no race, even though P1's plain store to y | ||||
| isn't w-post-bounded by any marked accesses. | ||||
| 
 | ||||
| Putting all this material together yields the following picture.  For | ||||
| two conflicting stores W and W', where W ->co W', the LKMM says the | ||||
| stores don't race if W can be linked to W' by a | ||||
| 
 | ||||
| 	w-post-bounded ; vis ; w-pre-bounded | ||||
| 
 | ||||
| sequence.  If W is plain then they also have to be linked by an | ||||
| 
 | ||||
| 	r-post-bounded ; xb* ; w-pre-bounded | ||||
| 
 | ||||
| sequence, and if W' is plain then they also have to be linked by a | ||||
| 
 | ||||
| 	w-post-bounded ; vis ; r-pre-bounded | ||||
| 
 | ||||
| sequence.  For a conflicting load R and store W, the LKMM says the two | ||||
| accesses don't race if R can be linked to W by an | ||||
| 
 | ||||
| 	r-post-bounded ; xb* ; w-pre-bounded | ||||
| 
 | ||||
| sequence or if W can be linked to R by a | ||||
| 
 | ||||
| 	w-post-bounded ; vis ; r-pre-bounded | ||||
| 
 | ||||
| sequence.  For the cases involving a vis link, the LKMM also accepts | ||||
| sequences in which W is linked to W' or R by a | ||||
| 
 | ||||
| 	strong-fence ; xb* ; {w and/or r}-pre-bounded | ||||
| 
 | ||||
| sequence with no post-bounding, and in every case the LKMM also allows | ||||
| the link simply to be a fence with no bounding at all.  If no sequence | ||||
| of the appropriate sort exists, the LKMM says that the accesses race. | ||||
| 
 | ||||
| There is one more part of the LKMM related to plain accesses (although | ||||
| not to data races) we should discuss.  Recall that many relations such | ||||
| as hb are limited to marked accesses only.  As a result, the | ||||
| happens-before, propagates-before, and rcu axioms (which state that | ||||
| various relation must not contain a cycle) doesn't apply to plain | ||||
| accesses.  Nevertheless, we do want to rule out such cycles, because | ||||
| they don't make sense even for plain accesses. | ||||
| 
 | ||||
| To this end, the LKMM imposes three extra restrictions, together | ||||
| called the "plain-coherence" axiom because of their resemblance to the | ||||
| rules used by the operational model to ensure cache coherence (that | ||||
| is, the rules governing the memory subsystem's choice of a store to | ||||
| satisfy a load request and its determination of where a store will | ||||
| fall in the coherence order): | ||||
| 
 | ||||
| 	If R and W conflict and it is possible to link R to W by one | ||||
| 	of the xb* sequences listed above, then W ->rfe R is not | ||||
| 	allowed (i.e., a load cannot read from a store that it | ||||
| 	executes before, even if one or both is plain). | ||||
| 
 | ||||
| 	If W and R conflict and it is possible to link W to R by one | ||||
| 	of the vis sequences listed above, then R ->fre W is not | ||||
| 	allowed (i.e., if a store is visible to a load then the load | ||||
| 	must read from that store or one coherence-after it). | ||||
| 
 | ||||
| 	If W and W' conflict and it is possible to link W to W' by one | ||||
| 	of the vis sequences listed above, then W' ->co W is not | ||||
| 	allowed (i.e., if one store is visible to a second then the | ||||
| 	second must come after the first in the coherence order). | ||||
| 
 | ||||
| This is the extent to which the LKMM deals with plain accesses. | ||||
| Perhaps it could say more (for example, plain accesses might | ||||
| contribute to the ppo relation), but at the moment it seems that this | ||||
| minimal, conservative approach is good enough. | ||||
| 
 | ||||
| 
 | ||||
| ODDS AND ENDS | ||||
| ------------- | ||||
| 
 | ||||
|  | @ -1943,6 +2481,16 @@ treated as READ_ONCE() and rcu_assign_pointer() is treated as | |||
| smp_store_release() -- which is basically how the Linux kernel treats | ||||
| them. | ||||
| 
 | ||||
| Although we said that plain accesses are not linked by the ppo | ||||
| relation, they do contribute to it indirectly.  Namely, when there is | ||||
| an address dependency from a marked load R to a plain store W, | ||||
| followed by smp_wmb() and then a marked store W', the LKMM creates a | ||||
| ppo link from R to W'.  The reasoning behind this is perhaps a little | ||||
| shaky, but essentially it says there is no way to generate object code | ||||
| for this source code in which W' could execute before R.  Just as with | ||||
| pre-bounding by address dependencies, it is possible for the compiler | ||||
| to undermine this relation if sufficient care is not taken. | ||||
| 
 | ||||
| There are a few oddball fences which need special treatment: | ||||
| smp_mb__before_atomic(), smp_mb__after_atomic(), and | ||||
| smp_mb__after_spinlock().  The LKMM uses fence events with special | ||||
|  |  | |||
|  | @ -197,7 +197,7 @@ empty (wr-incoh | rw-incoh | ww-incoh) as plain-coherence | |||
| (* Actual races *) | ||||
| let ww-nonrace = ww-vis & ((Marked * W) | rw-xbstar) & ((W * Marked) | wr-vis) | ||||
| let ww-race = (pre-race & co) \ ww-nonrace | ||||
| let wr-race = (pre-race & (co? ; rf)) \ wr-vis | ||||
| let wr-race = (pre-race & (co? ; rf)) \ wr-vis \ rw-xbstar^-1 | ||||
| let rw-race = (pre-race & fr) \ rw-xbstar | ||||
| 
 | ||||
| flag ~empty (ww-race | wr-race | rw-race) as data-race | ||||
|  |  | |||
|  | @ -1,8 +1,5 @@ | |||
| CONFIG_SMP=y | ||||
| CONFIG_NR_CPUS=2 | ||||
| CONFIG_HOTPLUG_CPU=n | ||||
| CONFIG_SUSPEND=n | ||||
| CONFIG_HIBERNATION=n | ||||
| CONFIG_PREEMPT_NONE=n | ||||
| CONFIG_PREEMPT_VOLUNTARY=n | ||||
| CONFIG_PREEMPT=y | ||||
|  |  | |||
|  | @ -9,9 +9,6 @@ CONFIG_NO_HZ_IDLE=y | |||
| CONFIG_NO_HZ_FULL=n | ||||
| CONFIG_RCU_FAST_NO_HZ=n | ||||
| CONFIG_RCU_TRACE=n | ||||
| CONFIG_HOTPLUG_CPU=n | ||||
| CONFIG_SUSPEND=n | ||||
| CONFIG_HIBERNATION=n | ||||
| CONFIG_RCU_FANOUT=3 | ||||
| CONFIG_RCU_FANOUT_LEAF=3 | ||||
| CONFIG_RCU_NOCB_CPU=n | ||||
|  |  | |||
|  | @ -9,9 +9,6 @@ CONFIG_NO_HZ_IDLE=n | |||
| CONFIG_NO_HZ_FULL=y | ||||
| CONFIG_RCU_FAST_NO_HZ=y | ||||
| CONFIG_RCU_TRACE=y | ||||
| CONFIG_HOTPLUG_CPU=n | ||||
| CONFIG_SUSPEND=n | ||||
| CONFIG_HIBERNATION=n | ||||
| CONFIG_RCU_FANOUT=4 | ||||
| CONFIG_RCU_FANOUT_LEAF=3 | ||||
| CONFIG_DEBUG_LOCK_ALLOC=n | ||||
|  |  | |||
|  | @ -9,9 +9,6 @@ CONFIG_NO_HZ_IDLE=y | |||
| CONFIG_NO_HZ_FULL=n | ||||
| CONFIG_RCU_FAST_NO_HZ=n | ||||
| CONFIG_RCU_TRACE=n | ||||
| CONFIG_HOTPLUG_CPU=n | ||||
| CONFIG_SUSPEND=n | ||||
| CONFIG_HIBERNATION=n | ||||
| CONFIG_RCU_FANOUT=6 | ||||
| CONFIG_RCU_FANOUT_LEAF=6 | ||||
| CONFIG_RCU_NOCB_CPU=n | ||||
|  |  | |||
|  | @ -9,9 +9,6 @@ CONFIG_NO_HZ_IDLE=y | |||
| CONFIG_NO_HZ_FULL=n | ||||
| CONFIG_RCU_FAST_NO_HZ=n | ||||
| CONFIG_RCU_TRACE=n | ||||
| CONFIG_HOTPLUG_CPU=n | ||||
| CONFIG_SUSPEND=n | ||||
| CONFIG_HIBERNATION=n | ||||
| CONFIG_RCU_FANOUT=3 | ||||
| CONFIG_RCU_FANOUT_LEAF=2 | ||||
| CONFIG_RCU_NOCB_CPU=y | ||||
|  |  | |||
|  | @ -8,9 +8,6 @@ CONFIG_HZ_PERIODIC=n | |||
| CONFIG_NO_HZ_IDLE=y | ||||
| CONFIG_NO_HZ_FULL=n | ||||
| CONFIG_RCU_TRACE=n | ||||
| CONFIG_HOTPLUG_CPU=n | ||||
| CONFIG_SUSPEND=n | ||||
| CONFIG_HIBERNATION=n | ||||
| CONFIG_RCU_NOCB_CPU=n | ||||
| CONFIG_DEBUG_LOCK_ALLOC=n | ||||
| CONFIG_RCU_BOOST=n | ||||
|  |  | |||
|  | @ -6,9 +6,6 @@ CONFIG_PREEMPT=n | |||
| CONFIG_HZ_PERIODIC=n | ||||
| CONFIG_NO_HZ_IDLE=y | ||||
| CONFIG_NO_HZ_FULL=n | ||||
| CONFIG_HOTPLUG_CPU=n | ||||
| CONFIG_SUSPEND=n | ||||
| CONFIG_HIBERNATION=n | ||||
| CONFIG_DEBUG_LOCK_ALLOC=n | ||||
| CONFIG_DEBUG_OBJECTS_RCU_HEAD=n | ||||
| CONFIG_RCU_EXPERT=y | ||||
|  |  | |||
|  | @ -6,7 +6,6 @@ Kconfig Parameters: | |||
| 
 | ||||
| CONFIG_DEBUG_LOCK_ALLOC -- Do three, covering CONFIG_PROVE_LOCKING & not. | ||||
| CONFIG_DEBUG_OBJECTS_RCU_HEAD -- Do one. | ||||
| CONFIG_HOTPLUG_CPU -- Do half.  (Every second.) | ||||
| CONFIG_HZ_PERIODIC -- Do one. | ||||
| CONFIG_NO_HZ_IDLE -- Do those not otherwise specified. (Groups of two.) | ||||
| CONFIG_NO_HZ_FULL -- Do two, one with partial CPU enablement. | ||||
|  |  | |||
		Loading…
	
	Add table
		
		Reference in a new issue
	
	 Linus Torvalds
						Linus Torvalds