mirror of
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
synced 2025-05-24 10:39:52 +00:00
8 commits
Author | SHA1 | Message | Date | |
---|---|---|---|---|
![]() |
d7ad05c86e |
timers/migration: Prevent out of bounds access on failure
When tmigr_setup_groups() fails the level 0 group allocation, then the
cleanup derefences index -1 of the local stack array.
Prevent this by checking the loop condition first.
Fixes:
|
||
![]() |
7a96a84bfb |
timers/migration: Return early on deactivation
Commit |
||
![]() |
61f7fdf8fd |
timers/migration: Fix ignored event due to missing CPU update
When a group event is updated with its expiry unchanged but a different
CPU, that target change may go unnoticed and the event may be propagated
up with a stale CPU value. The following depicts a scenario that has
been actually observed:
[GRP2:0]
migrator = GRP1:1
active = GRP1:1
nextevt = TGRP1:0 (T0)
/ \
[GRP1:0] [GRP1:1]
migrator = NONE [...]
active = NONE
nextevt = TGRP0:0 (T0)
/ \
[GRP0:0] [...]
migrator = NONE
active = NONE
nextevt = T0
/ \
0 (T0) 1 (T1)
idle idle
0) The hierarchy has 3 levels. The left part (GRP1:0) is all idle,
including CPU 0 and CPU 1 which have a timer each: T0 and T1. They have
the same expiry value.
[GRP2:0]
migrator = GRP1:1
active = GRP1:1
nextevt = KTIME_MAX
/ \
[GRP1:0] [GRP1:1]
migrator = NONE [...]
active = NONE
nextevt = TGRP0:0 (T0)
/ \
[GRP0:0] [...]
migrator = NONE
active = NONE
nextevt = T0
/ \
0 (T0) 1 (T1)
idle idle
1) The migrator in GRP1:1 handles remotely T0. The event is dequeued
from the top and T0 executed.
[GRP2:0]
migrator = GRP1:1
active = GRP1:1
nextevt = KTIME_MAX
/ \
[GRP1:0] [GRP1:1]
migrator = NONE [...]
active = NONE
nextevt = TGRP0:0 (T0)
/ \
[GRP0:0] [...]
migrator = NONE
active = NONE
nextevt = T1
/ \
0 1 (T1)
idle idle
2) The migrator in GRP1:1 fetches the next timer for CPU 0 and finds
none. But it updates the events from its groups, starting with GRP0:0
which now has T1 as its next event. So far so good.
[GRP2:0]
migrator = GRP1:1
active = GRP1:1
nextevt = KTIME_MAX
/ \
[GRP1:0] [GRP1:1]
migrator = NONE [...]
active = NONE
nextevt = TGRP0:0 (T0)
/ \
[GRP0:0] [...]
migrator = NONE
active = NONE
nextevt = T1
/ \
0 1 (T1)
idle idle
3) The migrator in GRP1:1 proceeds upward and updates the events in
GRP1:0. The child event TGRP0:0 is found queued with the same expiry
as before. And therefore it is left unchanged. However the target CPU
is not the same but that fact is ignored so TGRP0:0 still points to
CPU 0 when it should point to CPU 1.
[GRP2:0]
migrator = GRP1:1
active = GRP1:1
nextevt = TGRP1:0 (T0)
/ \
[GRP1:0] [GRP1:1]
migrator = NONE [...]
active = NONE
nextevt = TGRP0:0 (T0)
/ \
[GRP0:0] [...]
migrator = NONE
active = NONE
nextevt = T1
/ \
0 1 (T1)
idle idle
4) The propagation has reached the top level and TGRP1:0, having TGRP0:0
as its first event, also wrongly points to CPU 0. TGRP1:0 is added to
the top level group.
[GRP2:0]
migrator = GRP1:1
active = GRP1:1
nextevt = KTIME_MAX
/ \
[GRP1:0] [GRP1:1]
migrator = NONE [...]
active = NONE
nextevt = TGRP0:0 (T0)
/ \
[GRP0:0] [...]
migrator = NONE
active = NONE
nextevt = T1
/ \
0 1 (T1)
idle idle
5) The migrator in GRP1:1 dequeues the next event in top level pointing
to CPU 0. But since it actually doesn't see any real event in CPU 0, it
early returns.
6) T1 is left unhandled until either CPU 0 or CPU 1 wake up.
Some other bad scenario may involve trees with just two levels.
Fix this with unconditionally updating the CPU of the child event before
considering to early return while updating a queued event with an
unchanged expiry value.
Fixes:
|
||
![]() |
f55acb1e44 |
timers/migration: Fix endless timer requeue after idle interrupts
When a CPU is an idle migrator, but another CPU wakes up before it,
becomes an active migrator and handles the queue, the initial idle
migrator may end up endlessly reprogramming its clockevent, chasing ghost
timers forever such as in the following scenario:
[GRP0:0]
migrator = 0
active = 0
nextevt = T1
/ \
0 1
active idle (T1)
0) CPU 1 is idle and has a timer queued (T1), CPU 0 is active and is
the active migrator.
[GRP0:0]
migrator = NONE
active = NONE
nextevt = T1
/ \
0 1
idle idle (T1)
wakeup = T1
1) CPU 0 is now idle and is therefore the idle migrator. It has
programmed its next timer interrupt to handle T1.
[GRP0:0]
migrator = 1
active = 1
nextevt = KTIME_MAX
/ \
0 1
idle active
wakeup = T1
2) CPU 1 has woken up, it is now active and it has just handled its own
timer T1.
3) CPU 0 gets a timer interrupt to handle T1 but tmigr_handle_remote()
realize it is not the migrator anymore. So it early returns without
observing that T1 has been expired already and therefore without
updating its ->wakeup value.
4) CPU 0 goes into tmigr_cpu_new_timer() which also early returns
because it doesn't queue a timer of its own. So ->wakeup is left
unchanged and the next timer is programmed to fire now.
5) goto 3) forever
This results in timer interrupt storms in idle and also in nohz_full (as
observed in rcutorture's TREE07 scenario).
Fix this with forcing a re-evaluation of tmc->wakeup while trying
remote timer handling when the CPU isn't the migrator anymmore. The
check is inherently racy but in the worst case the CPU just races setting
the KTIME_MAX value that a remote expiry also tries to set.
Fixes:
|
||
![]() |
4b6f4c5a67 |
timer/migration: Remove buggy early return on deactivation
When a CPU enters into idle and deactivates itself from the timer migration hierarchy without any global timer of its own to propagate, the group event of that CPU is set to "ignore" and tmigr_update_events() accordingly performs an early return without considering timers queued by other CPUs. If the hierarchy has a single level, and the CPU is the last one to enter idle, it will ignore others' global timers, as in the following layout: [GRP0:0] migrator = 0 active = 0 nextevt = T0i / \ 0 1 active (T0i) idle (T1) 0) CPU 0 is active thus its event is ignored (the letter 'i') and so are upper levels' events. CPU 1 is idle and has the timer T1 enqueued. [GRP0:0] migrator = NONE active = NONE nextevt = T0i / \ 0 1 idle (T0i) idle (T1) 1) CPU 0 goes idle without global event queued. Therefore KTIME_MAX is pushed as its next expiry and its own event kept as "ignore". As a result tmigr_update_events() ignores T1 and CPU 0 goes to idle with T1 unhandled. This isn't proper to single level hierarchy though. A similar issue, although slightly different, may arise on multi-level: [GRP1:0] migrator = GRP0:0 active = GRP0:0 nextevt = T0:0i, T0:1 / \ [GRP0:0] [GRP0:1] migrator = 0 migrator = NONE active = 0 active = NONE nextevt = T0i nextevt = T2 / \ / \ 0 (T0i) 1 (T1) 2 (T2) 3 active idle idle idle 0) CPU 0 is active thus its event is ignored (the letter 'i') and so are upper levels' events. CPU 1 is idle and has the timer T1 enqueued. CPU 2 also has a timer. The expiry order is T0 (ignored) < T1 < T2 [GRP1:0] migrator = GRP0:0 active = GRP0:0 nextevt = T0:0i, T0:1 / \ [GRP0:0] [GRP0:1] migrator = NONE migrator = NONE active = NONE active = NONE nextevt = T0i nextevt = T2 / \ / \ 0 (T0i) 1 (T1) 2 (T2) 3 idle idle idle idle 1) CPU 0 goes idle without global event queued. Therefore KTIME_MAX is pushed as its next expiry and its own event kept as "ignore". As a result tmigr_update_events() ignores T1. The change only propagated up to 1st level so far. [GRP1:0] migrator = NONE active = NONE nextevt = T0:1 / \ [GRP0:0] [GRP0:1] migrator = NONE migrator = NONE active = NONE active = NONE nextevt = T0i nextevt = T2 / \ / \ 0 (T0i) 1 (T1) 2 (T2) 3 idle idle idle idle 2) The change now propagates up to the top. tmigr_update_events() finds that the child event is ignored and thus removes it. The top level next event is now T2 which is returned to CPU 0 as its next effective expiry to take account for as the global idle migrator. However T1 has been ignored along the way, leaving it unhandled. Fix those issues with removing the buggy related early return. Ignored child events must not prevent from evaluating the other events within the same group. Reported-by: Boqun Feng <boqun.feng@gmail.com> Reported-by: Florian Fainelli <f.fainelli@gmail.com> Reported-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Florian Fainelli <florian.fainelli@broadcom.com> Link: https://lore.kernel.org/r/ZfOhB9ZByTZcBy4u@lothringen |
||
![]() |
8ca1836769 |
timer/migration: Fix quick check reporting late expiry
When a CPU is the last active in the hierarchy and it tries to enter
into idle, the quick check looking up the next event towards cpuidle
heuristics may report a too late expiry, such as in the following
scenario:
[GRP1:0]
migrator = NONE
active = NONE
nextevt = T0:0, T0:1
/ \
[GRP0:0] [GRP0:1]
migrator = NONE migrator = NONE
active = NONE active = NONE
nextevt = T0, T1 nextevt = T2
/ \ / \
0 1 2 3
idle idle idle idle
0) The whole system is idle, and CPU 0 was the last migrator. CPU 0 has
a timer (T0), CPU 1 has a timer (T1) and CPU 2 has a timer (T2). The
expire order is T0 < T1 < T2.
[GRP1:0]
migrator = GRP0:0
active = GRP0:0
nextevt = T0:0(i), T0:1
/ \
[GRP0:0] [GRP0:1]
migrator = CPU0 migrator = NONE
active = CPU0 active = NONE
nextevt = T0(i), T1 nextevt = T2
/ \ / \
0 1 2 3
active idle idle idle
1) CPU 0 becomes active. The (i) means a now ignored timer.
[GRP1:0]
migrator = GRP0:0
active = GRP0:0
nextevt = T0:1
/ \
[GRP0:0] [GRP0:1]
migrator = CPU0 migrator = NONE
active = CPU0 active = NONE
nextevt = T1 nextevt = T2
/ \ / \
0 1 2 3
active idle idle idle
2) CPU 0 handles remote. No timer actually expired but ignored timers
have been cleaned out and their sibling's timers haven't been
propagated. As a result the top level's next event is T2 and not T1.
3) CPU 0 tries to enter idle without any global timer enqueued and calls
tmigr_quick_check(). The expiry of T2 is returned instead of the
expiry of T1.
When the quick check returns an expiry that is too late, the cpuidle
governor may pick up a C-state that is too deep. This may be result into
undesired CPU wake up latency if the next timer is actually close enough.
Fix this with assuming that expiries aren't sorted top-down while
performing the quick check. Pick up instead the earliest encountered one
while walking up the hierarchy.
|
||
![]() |
36e40df35d |
timer_migration: Add tracepoints
The timer pull logic needs proper debugging aids. Add tracepoints so the hierarchical idle machinery can be diagnosed. Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240222103403.31923-1-anna-maria@linutronix.de |
||
![]() |
7ee9887703 |
timers: Implement the hierarchical pull model
Placing timers at enqueue time on a target CPU based on dubious heuristics does not make any sense: 1) Most timer wheel timers are canceled or rearmed before they expire. 2) The heuristics to predict which CPU will be busy when the timer expires are wrong by definition. So placing the timers at enqueue wastes precious cycles. The proper solution to this problem is to always queue the timers on the local CPU and allow the non pinned timers to be pulled onto a busy CPU at expiry time. Therefore split the timer storage into local pinned and global timers: Local pinned timers are always expired on the CPU on which they have been queued. Global timers can be expired on any CPU. As long as a CPU is busy it expires both local and global timers. When a CPU goes idle it arms for the first expiring local timer. If the first expiring pinned (local) timer is before the first expiring movable timer, then no action is required because the CPU will wake up before the first movable timer expires. If the first expiring movable timer is before the first expiring pinned (local) timer, then this timer is queued into an idle timerqueue and eventually expired by another active CPU. To avoid global locking the timerqueues are implemented as a hierarchy. The lowest level of the hierarchy holds the CPUs. The CPUs are associated to groups of 8, which are separated per node. If more than one CPU group exist, then a second level in the hierarchy collects the groups. Depending on the size of the system more than 2 levels are required. Each group has a "migrator" which checks the timerqueue during the tick for remote expirable timers. If the last CPU in a group goes idle it reports the first expiring event in the group up to the next group(s) in the hierarchy. If the last CPU goes idle it arms its timer for the first system wide expiring timer to ensure that no timer event is missed. Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/r/20240222103710.32582-1-anna-maria@linutronix.de |