mirror of
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
synced 2025-08-05 16:54:27 +00:00
Merge branch 'resilient-queued-spin-lock'
Kumar Kartikeya Dwivedi says: ==================== Resilient Queued Spin Lock Changelog: ---------- v3 -> v4 v4: https://lore.kernel.org/bpf/20250303152305.3195648-1-memxor@gmail.com * Fix bisectability problem by reordering locktorture commit before Makefile commit. * Add EXPORT_SYMBOL_GPL to all used symbols and variables by consumers. * Skip BPF selftest when nrprocs < 2. * Fix kdoc to describe return value for res_spin_lock, slowpath. * Move kernel/locking/rqspinlock.{c,h} to kernel/bpf/rqspinlock.{c,h}. v2 -> v3 v2: https://lore.kernel.org/bpf/20250206105435.2159977-1-memxor@gmail.com * Add ifdef's to fallback to Ankur's patch when it gets in, until then copy-paste the implementation. * Change the meaning of RES_DEF_TIMEOUT from two critical section lengths to one for clarity, and use RES_DEF_TIMEOUT * 2 where needed. * Use NSEC_PER_SEC as timeout for TAS fallback. * Add Closes: tags for known syzbot reports. * Change timeout for TAS fallback to 1 second. * Fix more kernel test robot errors. * More comments about smp_wmb in release_held_lock_entry interaction. * Change RES_NR_HELD to 31. * Address comments from Peter, Eduard, Alexei. v1 -> v2 v1: https://lore.kernel.org/bpf/20250107140004.2732830-1-memxor@gmail.com * Address nits from Waiman and Peter * Fix arm64 WFE bug pointed out by Peter. * Fix incorrect memory ordering in release_held_lock_entry, and document subtleties. Explain why release is sufficient in unlock but not in release_held_lock_entry. * Remove dependence on CONFIG_QUEUED_SPINLOCKS and introduce a test-and-set fallback when queued spinlock support is missing on an architecture. * Enforce FIFO ordering for BPF program spin unlocks. * Address comments from Eduard on verifier plumbing. * Add comments as suggested by Waiman. * Refactor paravirt TAS lock to use the implemented TAS fallback. * Use rqspinlock_t as the type throughout so that it can be replaced with a non-qspinlock type in case of fallback. * Testing and benchmarking on arm64, added numbers to the cover letter. * Fix kernel test robot errors. * Fix a BPF selftest bug leading to spurious failures on arm64. Introduction ------------ This patch set introduces Resilient Queued Spin Lock (or rqspinlock with res_spin_lock() and res_spin_unlock() APIs). This is a qspinlock variant which recovers the kernel from a stalled state when the lock acquisition path cannot make forward progress. This can occur when a lock acquisition attempt enters a deadlock situation (e.g. AA, or ABBA), or more generally, when the owner of the lock (which we’re trying to acquire) isn’t making forward progress. The cover letter provides an overview of the motivation, design, and alternative approaches. We then provide evaluation numbers showcasing that while rqspinlock incurs overhead, the performance of rqspinlock approaches that of the normal qspinlock used by the kernel. The evaluations for rqspinlock were performed by replacing the default qspinlock implementation with it and booting the kernel to run the experiments. Support for locktorture is also included with numbers in this series. The cover letter's design section provides an overview of the algorithmic approach. A technical document describing the implementation in more detail is available here: https://github.com/kkdwivedi/rqspinlock/blob/main/rqspinlock.pdf We have a WIP TLA+ proof for liveness and mutual exclusion of rqspinlock built on top of the qspinlock TLA+ proof from Catalin Marinas [3]. We will share more details and the links in the near future. Motivation ---------- In regular kernel code, usage of locks is assumed to be correct, so as to avoid deadlocks and stalls by construction, however, the same is not true for BPF programs. Users write normal C code and the in-kernel eBPF runtime ensures the safety of the kernel by rejecting unsafe programs. Users can upload programs that use locks in an improper fashion, and may cause deadlocks when these programs run inside the kernel. The verifier is responsible for rejecting such programs from being loaded into the kernel. Until now, the eBPF verifier ensured deadlock safety by only permitting one lock acquisition at a time, and by preventing any functions to be called from within the critical section. Additionally, only a few restricted program types are allowed to call spin locks. As the usage of eBPF grows (e.g. with sched_ext) beyond its conventional application in networking, tracing, and security, the limitations on locking are becoming a bottleneck for users. The rqspinlock implementation allows us to permit more flexible locking patterns in BPF programs, without limiting them to the subset that can be proven safe statically (which is fairly small, and requires complex static analysis), while ensuring that the kernel will recover in case we encounter a locking violation at runtime. We make a tradeoff here by accepting programs that may potentially have deadlocks, and recover the kernel quickly at runtime to ensure availability. Additionally, eBPF programs attached to different parts of the kernel can introduce new control flow into the kernel, which increases the likelihood of deadlocks in code not written to handle reentrancy. There have been multiple syzbot reports surfacing deadlocks in internal kernel code due to the diverse ways in which eBPF programs can be attached to different parts of the kernel. By switching the BPF subsystem’s lock usage to rqspinlock, all of these issues can be mitigated at runtime. This spin lock implementation allows BPF maps to become safer and remove mechanisms that have fallen short in assuring safety when nesting programs in arbitrary ways in the same context or across different contexts. The red diffs due to patches 16-18 demonstrate this simplification. > kernel/bpf/hashtab.c | 102 ++++++++++++++++++++++++++++++++--------------------------... > kernel/bpf/lpm_trie.c | 25 ++++++++++++++----------- > kernel/bpf/percpu_freelist.c | 113 +++++++++++++++++++++++++---------------------------------... > kernel/bpf/percpu_freelist.h | 4 ++-- > 4 files changed, 73 insertions(+), 171 deletions(-) Design ------ Deadlocks mostly manifest as stalls in the waiting loops of the qspinlock slow path. Thus, using stalls as a signal for deadlocks avoids introducing cost to the normal fast path, and ensures bounded termination of the waiting loop. Our recovery algorithm is focused on terminating the waiting loops of the qspinlock algorithm when it gets stuck, and implementing bespoke recovery procedures for each class of waiter to restore the lock to a usable state. Deadlock detection is the main mechanism used to provide faster recovery, with the timeout mechanism acting as a final line of defense. Deadlock Detection ~~~~~~~~~~~~~~~~~~ We handle two cases of deadlocks: AA deadlocks (attempts to acquire the same lock again), and ABBA deadlocks (attempts to acquire two locks in the opposite order from two distinct threads). Variants of ABBA deadlocks may be encountered with more than two locks being held in the incorrect order. These are not diagnosed explicitly, as they reduce to ABBA deadlocks. Deadlock detection is triggered immediately when beginning the waiting loop of a lock slow path. While timeouts ensure that any waiting loops in the locking slow path terminate and return to the caller, it can be excessively long in some situations. While the default timeout is short (0.5s), a stall for this duration inside the kernel can set off alerts for latency-critical services with strict SLOs. Ideally, the kernel should recover from an undesired state of the lock as soon as possible. A multi-step strategy is used to recover the kernel from waiting loops in the locking algorithm which may fail to terminate in a bounded amount of time. * Each CPU maintains a table of held locks. Entries are inserted and removed upon entry into lock, and exit from unlock, respectively. * Deadlock detection for AA locks is thus simple: we have an AA deadlock if we find a held lock entry for the lock we’re attempting to acquire on the same CPU. * During deadlock detection for ABBA, we search through the tables of all other CPUs to find situations where we are holding a lock the remote CPU is attempting to acquire, and they are holding a lock we are attempting to acquire. Upon encountering such a condition, we report an ABBA deadlock. * We divide the duration between entry time point into the waiting loop and the timeout time point into intervals of 1 ms, and perform deadlock detection until timeout happens. Upon entry into the slow path, and then completion of each 1 ms interval, we perform detection of both AA and ABBA deadlocks. In the event that deadlock detection yields a positive result, the recovery happens sooner than the timeout. Otherwise, it happens as a last resort upon completion of the timeout. Timeouts ~~~~~~~~ Timeouts act as final line of defense against stalls for waiting loops. The ‘ktime_get_mono_fast_ns’ function is used to poll for the current time, and it is compared to the timestamp indicating the end time in the waiter loop. Each waiting loop is instrumented to check an extra condition using a macro. Internally, the macro implementation amortizes the checking of the timeout to avoid sampling the clock in every iteration. Precisely, the timeout checks are invoked every 64k iterations. Recovery ~~~~~~~~ There is extensive literature in academia on designing locks that support timeouts [0][1], as timeouts can be used as a proxy for detecting the presence of deadlocks and recovering from them, without maintaining explicit metadata to construct a waits-for relationship between two threads at runtime. In case of rqspinlock, the key simplification in our algorithm comes from the fact that upon a timeout, waiters always leave the queue in FIFO order. As such, the timeout is only enforced by the head of the wait queue, while other waiters rely on the head to signal them when a timeout has occurred and when they need to exit. We don’t have to implement complex algorithms and do not need extra synchronization for waiters in the middle of the queue timing out before their predecessor or successor, unlike previous approaches [0][1]. There are three forms of waiters in the original queued spin lock algorithm. The first is the waiter which acquires the pending bit and spins on the lock word without forming a wait queue. The second is the head waiter that is the first waiter heading the wait queue. The third form is of all the non-head waiters queued behind the head, waiting to be signalled through their MCS node to overtake the responsibility of the head. In rqspinlock's recovery algorithm, we are concerned with the second and third kind. First, we augment the waiting loop of the head of the wait queue with a timeout. When this timeout happens, all waiters part of the wait queue will abort their lock acquisition attempts. This happens in three steps. * First, the head breaks out of its loop waiting for pending and locked bits to turn to 0, and non-head waiters break out of their MCS node spin (more on that later). * Next, every waiter (head or non-head) attempts to check whether they are also the tail waiter, in such a case they attempt to zero out the tail word and allow a new queue to be built up for this lock. If they succeed, they have no one to signal next in the queue to stop spinning. * Otherwise, they signal the MCS node of the next waiter to break out of its spin and try resetting the tail word back to 0. This goes on until the tail waiter is found. In case of races, the new tail will be responsible for performing the same task, as the old tail will then fail to reset the tail word and wait for its next pointer to be updated before it signals the new tail to do the same. Timeout Bound ~~~~~~~~~~~~~ The timeout is applied by two types of waiters: the pending bit waiter and the wait queue head waiter. As such, for the pending waiter, only the lock owner is ahead of it, and for the wait queue head waiter, only the lock owner and the pending waiter take precedence in executing their critical sections. We define the timeout value to span at most 1 critical section length, and then use the appropriate value (default, or default x 2) depending on if we are the pending waiter or head of wait queue. Therefore, the waiting loop wait can span at most 2 critical section lengths, and thus, it is unaffected by the amount of contention or the number of CPUs on the host. Non-head waiters simply wait for the wait queue head to signal them on a timeout. In Meta's production, we have noticed uncore PMU reads and SMIs consuming tens of msecs. While these events are rare, a 0.25 second timeout should absorb such tail events and not raise false alarms for timeouts. We will continue monitoring this in production and adjust the timeout if necessary in the future. More details of the recovery algorithm is described in patch 9 and a detailed description is available at [2]. Alternatives ------------ Lockdep: We do not rely on the lockdep facility for reporting violations for primarily two reasons: * Overhead: The lockdep infrastructure can add significant overhead to the lock acquisition path, and is not recommended for use in production due to this reason. While the report is more useful and exhaustive, the overhead can be prohibitive, especially as BPF programs run in hot paths of the kernel. Moreover, it also increases the size of the lock word to store extra metadata, which is not feasible for BPF spin locks that are 4-bytes in size today (similar to qspinlock). * Debug Tool: Lockdep is intended to be used as a debugging facility, providing extra context to the user about the locking violations occurring during runtime. It is always turned off on all production kernels, therefore isn’t available most of the time. We require a mechanism for detecting common variants of deadlocks that is always available in production kernels and never turned off. At the same time, it must not introduce overhead in terms of time (for the slow path) and memory (for the lock word size). Evaluation ---------- We run benchmarks that stress locking scalability and perform comparison against the baseline (qspinlock). For the rqspinlock case, we replace the default qspinlock with it in the kernel, such that all spin locks in the kernel use the rqspinlock slow path. As such, benchmarks that stress kernel spin locks end up exercising rqspinlock. Evaluation setup ~~~~~~~~~~~~~~~~ We set the CPU governor to performance for all experiments. Note: Numbers for arm64 have been obtained without the no-WFE fallback in this series, to perform a fair comparison with the WFE using qspinlock baseline. x86_64: Intel Xeon Platinum 8468 (Sapphire Rapids) 96 cores (48 x 2 sockets) 2 threads per core, 0-95, siblings from 96-191 2 NUMA nodes (every 48 cores), 2 LLCs (every 48 cores), 1 LLC per NUMA node Hyperthreading enabled arm64: Ampere Max Neoverse-N1 256-Core Processor 256 cores (128 cores x 2 sockets) 1 thread per core 2 NUMA nodes (every 128 cores), 1 L2 per core (256 instances), no shared L3 No hyperthreading available The locktorture experiment is run for 30 seconds. Average of 25 runs is used for will-it-scale, after an initial warm up. More information on the locks contended in the will-it-scale experiments is available in the evaluation section of the CNA paper, in table 1 [4]. Legend: QL - qspinlock (avg. throughput) RQL - rqspinlock (avg. throughput) Results ~~~~~~~ locktorture - x86_64 Threads QL RQL Speedup ----------------------------------------------- 1 46910437 45057327 0.96 2 29871063 25085034 0.84 413876024
19242776 1.39 8 14638499 13346847 0.91 16 14380506 14104716 0.98 24 17278144 15293077 0.89 32 19494283 17826675 0.91 40 27760955 21002910 0.76 48 28638897 26432549 0.92 56 29336194 26512029 0.9 64 30040731 27421403 0.91 72 29523599 27010618 0.91 80 28846738 27885141 0.97 88 29277418 25963753 0.89 96 28472339 27423865 0.96 104 28093317 26634895 0.95 112 29914000 27872339 0.93 120 29199580 26682695 0.91 128 27755880 27314662 0.98 136 30349095 27092211 0.89 144 29193933 27805445 0.95 152 28956663 26071497 0.9 160 28950009 28183864 0.97 168 29383520 28135091 0.96 176 28475883 27549601 0.97 184 31958138 28602434 0.89 192 31342633 33394385 1.07 will-it-scale open1_threads - x86_64 Threads QL QL stddev stddev% RQL RQL stddev stddev% Speedup ----------------------------------------------------------------------------------------------- 1 1396323.92 7373.12 0.53 1366616.8 4152.08 0.3 0.98 2 1844403.8 3165.26 0.17 1700301.96 2396.58 0.14 0.92 4 2370590.6 24545.54 1.04 1655872.32 47938.71 2.9 0.7 8 2185227.04 9537.9 0.44 1691205.16 9783.25 0.58 0.77 16 2110672.36 10972.99 0.52 1781696.24 15021.43 0.84 0.84 24 1655042.72 18037.23 1.09 2165125.4 5422.54 0.25 1.31 32 1738928.24 7166.64 0.41 1829468.24 9081.59 0.5 1.05 40 1854430.52 6148.24 0.33 1731062.28 3311.95 0.19 0.93 48 1766529.96 5063.86 0.29 1749375.28 2311.27 0.13 0.99 56 1303016.28 6168.4 0.47 1452656 7695.29 0.53 1.11 64 1169557.96 4353.67 0.37 1287370.56 8477.2 0.66 1.1 72 1036023.4 7116.53 0.69 1135513.92 9542.55 0.84 1.1 80 1097913.64 11356 1.03 1176864.8 6771.41 0.58 1.07 88 1123907.36 12843.13 1.14 1072416.48 7412.25 0.69 0.95 96 1166981.52 9402.71 0.81 1129678.76 9499.14 0.84 0.97 104 1108954.04 8171.46 0.74 1032044.44 7840.17 0.76 0.93 112 1000777.76 8445.7 0.84 1078498.8 6551.47 0.61 1.08 120 1029448.4 6992.29 0.68 1093743 8378.94 0.77 1.06 128 1106670.36 10102.15 0.91 1241438.68 23212.66 1.87 1.12 136 1183776.88 6394.79 0.54 1116799.64 18111.38 1.62 0.94 144 1201122 25917.69 2.16 1301779.96 15792.6 1.21 1.08 152 1099737.08 13567.82 1.23 1053647.2 12704.29 1.21 0.96 160 1031186.32 9048.07 0.88 1069961.4 8293.18 0.78 1.04 168 1068817 16486.06 1.54 1096495.36 14021.93 1.28 1.03 176 966633.96 9623.27 1 1081129.84 9474.81 0.88 1.12 184 1004419.04 12111.11 1.21 1037771.24 12001.66 1.16 1.03 192 1088858.08 16522.93 1.52 1027943.12 14238.57 1.39 0.94 will-it-scale open2_threads - x86_64 Threads QL QL stddev stddev% RQL RQL stddev stddev% Speedup ----------------------------------------------------------------------------------------------- 1 1337797.76 4649.19 0.35 1332609.4 3813.14 0.29 1 2 1598300.2 1059.93 0.07 1771891.36 5667.12 0.32 1.11 4 1736573.76 13025.33 0.75 1396901.2 2682.46 0.19 0.8 8 1794367.84 4879.6 0.27 1917478.56 3751.98 0.2 1.07 16 1990998.44 8332.78 0.42 1864165.56 9648.59 0.52 0.94 24 1868148.56 4248.23 0.23 1710136.68 2760.58 0.16 0.92 32 1955180 6719 0.34 1936149.88 1980.87 0.1 0.99 40 1769646.4 4686.54 0.26 1729653.68 4551.22 0.26 0.98 48 1724861.16 4056.66 0.24 1764900 971.11 0.06 1.02 56 1318568 7758.86 0.59 1385660.84 7039.8 0.51 1.05 64 1143290.28 5351.43 0.47 1316686.6 5597.69 0.43 1.15 72 1196762.68 10655.67 0.89 1230173.24 9858.2 0.8 1.03 80 1126308.24 6901.55 0.61 1085391.16 7444.34 0.69 0.96 88 1035672.96 5452.95 0.53 1035541.52 8095.33 0.78 1 96 1030203.36 6735.71 0.65 1020113.48 8683.13 0.85 0.99 104 1039432.88 6583.59 0.63 1083902.48 5775.72 0.53 1.04 112 1113609.04 4380.62 0.39 1072010.36 8983.14 0.84 0.96 120 1109420.96 7183.5 0.65 1079424.12 10929.97 1.01 0.97 128 1095400.04 4274.6 0.39 1095475.2 12042.02 1.1 1 136 1071605.4 11103.73 1.04 1114757.2 10516.55 0.94 1.04 144 1104147.2 9714.75 0.88 1044954.16 7544.2 0.72 0.95 152 1164280.24 13386.15 1.15 1101213.92 11568.49 1.05 0.95 160 1084892.04 7941.25 0.73 1152273.76 9593.38 0.83 1.06 168 983654.76 11772.85 1.2 1111772.28 9806.83 0.88 1.13 176 1087544.24 11262.35 1.04 1077507.76 9442.02 0.88 0.99 184 1101682.4 24701.68 2.24 1095223.2 16707.29 1.53 0.99 192 983712.08 13453.59 1.37 1051244.2 15662.05 1.49 1.07 will-it-scale lock1_threads - x86_64 Threads QL QL stddev stddev% RQL RQL stddev stddev% Speedup ----------------------------------------------------------------------------------------------- 1 4307484.96 3959.31 0.09 4252908.56 10375.78 0.24 0.99 2 7701844.32 4169.88 0.05 7219233.52 6437.11 0.09 0.94 4 14781878.72 22854.85 0.15 15260565.12 37305.71 0.24 1.03 8 12949698.64 99270.42 0.77 9954660.4 142805.68 1.43 0.77 16 12947690.64 72977.27 0.56 10865245.12 49520.31 0.46 0.84 24 11142990.64 33200.39 0.3 11444391.68 37884.46 0.33 1.03 32 9652335.84 22369.48 0.23 9344086.72 21639.22 0.23 0.97 40 9185931.12 5508.96 0.06 8881506.32 5072.33 0.06 0.97 48 9084385.36 10871.05 0.12 8863579.12 4583.37 0.05 0.98 56 6595540.96 33100.59 0.5 6640389.76 46619.96 0.7 1.01 64 5946726.24 47160.5 0.79 6572155.84 91973.73 1.4 1.11 72 6744894.72 43166.65 0.64 5991363.36 80637.56 1.35 0.89 80 6234502.16 118983.16 1.91 5157894.32 73592.72 1.43 0.83 88 5053879.6 199713.75 3.95 4479758.08 36202.27 0.81 0.89 96 5184302.64 99199.89 1.91 5249210.16 122348.69 2.33 1.01 104 4612391.92 40803.05 0.88 4850209.6 26813.28 0.55 1.05 112 4809209.68 24070.68 0.5 4869477.84 27489.04 0.56 1.01 120 5130746.4 34265.5 0.67 4620047.12 44229.54 0.96 0.9 128 5376465.28 95028.05 1.77 4781179.6 43700.93 0.91 0.89 136 5453742.4 86718.87 1.59 5412457.12 40339.68 0.75 0.99 144 5805040.72 84669.31 1.46 5595382.48 68701.65 1.23 0.96 152 5842897.36 31120.33 0.53 5787587.12 43521.68 0.75 0.99 160 5837665.12 14179.44 0.24 5118808.72 45193.23 0.88 0.88 168 5660332.72 27467.09 0.49 5104959.04 40891.75 0.8 0.9 176 5180312.24 28656.39 0.55 4718407.6 58734.13 1.24 0.91 184 4706824.16 50469.31 1.07 4692962.64 92266.85 1.97 1 192 5126054.56 51082.02 1 4680866.8 58743.51 1.25 0.91 will-it-scale lock2_threads - x86_64 Threads QL QL stddev stddev% RQL RQL stddev stddev% Speedup ----------------------------------------------------------------------------------------------- 1 4316091.2 4933.28 0.11 4293104 30369.71 0.71 0.99 2 3500046.4 19852.62 0.57 4507627.76 23667.66 0.53 1.29 4 3639098.96 26370.65 0.72 3673166.32 30822.71 0.84 1.01 8 3714548.56 49953.44 1.34 4055818.56 71630.41 1.77 1.09 16 4188724.64 105414.49 2.52 4316077.12 68956.15 1.6 1.03 24 3737908.32 47391.46 1.27 3762254.56 55345.7 1.47 1.01 32 3820952.8 45207.66 1.18 3710368.96 52651.92 1.42 0.97 40 3791280.8 28630.55 0.76 3661933.52 37671.27 1.03 0.97 48 3765721.84 59553.83 1.58 3604738.64 50861.36 1.41 0.96 56 3175505.76 64336.17 2.03 2771022.48 66586.99 2.4 0.87 64 2620294.48 71651.34 2.73 2650171.68 44810.83 1.69 1.01 72 2861893.6 86542.61 3.02 2537437.2 84571.75 3.33 0.89 80 2976297.2 83566.43 2.81 2645132.8 85992.34 3.25 0.89 88 2547724.8 102014.36 4 2336852.16 80570.25 3.45 0.92 96 2945310.32 82673.25 2.81 2513316.96 45741.81 1.82 0.85 104 3028818.64 90643.36 2.99 2581787.52 52967.48 2.05 0.85 112 2546264.16 102605.82 4.03 2118812.64 62043.19 2.93 0.83 120 2917334.64 112220.01 3.85 2720418.64 64035.96 2.35 0.93 128 2906621.84 69428.1 2.39 2795310.32 56736.87 2.03 0.96 136 2841833.76 105541.11 3.71 3063404.48 62288.94 2.03 1.08 144 3032822.32 134796.56 4.44 3169985.6 149707.83 4.72 1.05 152 2557694.96 62218.15 2.43 2469887.6 68343.78 2.77 0.97 160 2810214.72 61468.79 2.19 2323768.48 54226.71 2.33 0.83 168 2651146.48 76573.27 2.89 2385936.64 52433.98 2.2 0.9 176 2720616.32 89026.19 3.27 2941400.08 59296.64 2.02 1.08 184 2696086 88541.24 3.28 2598225.2 76365.7 2.94 0.96 192 2908194.48 87023.91 2.99 2377677.68 53299.82 2.24 0.82 locktorture - arm64 Threads QL RQL Speedup ----------------------------------------------- 1 43320464 44718174 1.03 2 21056971 29255448 1.39 4 16040120 11563981 0.72 8 12786398 12838909 1 16 13646408 13436730 0.98 24 13597928 13669457 1.01 32 16456220 14600324 0.89 40 16667726 13883101 0.83 48 14347691 14608641 1.02 56 15624580 15180758 0.97 64 18105114 16009137 0.88 72 16606438 14772256 0.89 80 16550202 14124056 0.85 88 16716082 15930618 0.95 96 16489242 16817657 1.02 104 17915808 17165324 0.96 112 17217482 21343282 1.24 120 20449845 20576123 1.01 128 18700902 20286275 1.08 136 17913378 21142921 1.18 144 18225673 18971921 1.04 152 18374206 19229854 1.05 160 23136514 20129504 0.87 168 21096269 17167777 0.81 176 21376794 21594914 1.01 184 23542989 20638298 0.88 192 22793754 20655980 0.91 200 20933027 19628316 0.94 208 23105684 25572720 1.11 216 24158081 23173848 0.96 224 23388984 22485353 0.96 232 21916401 23899343 1.09 240 22292129 22831784 1.02 248 25812762 22636787 0.88 256 24294738 26127113 1.08 will-it-scale open1_threads - arm64 Threads QL QL stddev stddev% RQL RQL stddev stddev% Speedup ----------------------------------------------------------------------------------------------- 1 844452.32 801 0.09 804936.92 900.25 0.11 0.95 2 1309419.08 9495.78 0.73 1265080.24 3171.13 0.25 0.97 4 2113074.24 5363.19 0.25 2041158.28 7883.65 0.39 0.97 8 1916650.96 15749.86 0.82 2039850.04 7562.87 0.37 1.06 16 1835540.72 12940.45 0.7 1937398.56 11461.15 0.59 1.06 24 1876760.48 12581.67 0.67 1966659.16 10012.69 0.51 1.05 32 1834525.6 5571.08 0.3 1929180.4 6221.96 0.32 1.05 40 1851592.76 7848.18 0.42 1937504.44 5991.55 0.31 1.05 48 1845067 4118.68 0.22 1773331.56 6068.23 0.34 0.96 56 1742709.36 6874.03 0.39 1716184.92 6713.16 0.39 0.98 64 1685339.72 6688.91 0.4 1676046.16 5844.06 0.35 0.99 72 1694838.84 2433.41 0.14 1821189.6 2906.89 0.16 1.07 80 1738778.68 2916.74 0.17 1729212.6 3714.41 0.21 0.99 88 1753131.76 2734.34 0.16 1713294.32 4652.82 0.27 0.98 96 1694112.52 4449.69 0.26 1714438.36 5621.66 0.33 1.01 104 1780279.76 2420.52 0.14 1767679.12 3067.66 0.17 0.99 112 1700284.72 9796.23 0.58 1796674.6 4066.06 0.23 1.06 120 1760466.72 3978.65 0.23 1704706.08 4080.04 0.24 0.97 128 1634067.96 5187.94 0.32 1764115.48 3545.02 0.2 1.08 136 1170303.84 7602.29 0.65 1227188.04 8090.84 0.66 1.05 144 953186.16 7859.02 0.82 964822.08 10536.61 1.09 1.01 152 818893.96 7238.86 0.88 853412.44 5932.25 0.7 1.04 160 707460.48 3868.26 0.55 746985.68 10363.03 1.39 1.06 168 658380.56 4938.77 0.75 672101.12 5442.95 0.81 1.02 176 614692.04 3137.74 0.51 615143.36 6197.19 1.01 1 184 574808.88 4741.61 0.82 592395.08 8840.92 1.49 1.03 192 548142.92 6116.31 1.12 571299.68 8388.56 1.47 1.04 200 511621.96 2182.33 0.43 532144.88 5467.04 1.03 1.04 208 506583.32 6834.39 1.35 521427.08 10318.65 1.98 1.03 216 480438.04 3608.96 0.75 510697.76 8086.47 1.58 1.06 224 470644.96 3451.35 0.73 467433.92 5008.59 1.07 0.99 232 466973.72 6599.97 1.41 444345.92 2144.96 0.48 0.95 240 442927.68 2351.56 0.53 440503.56 4289.01 0.97 0.99 248 432991.16 5829.92 1.35 445462.6 5944.03 1.33 1.03 256 409455.44 1430.5 0.35 422219.4 4007.04 0.95 1.03 will-it-scale open2_threads - arm64 Threads QL QL stddev stddev% RQL RQL stddev stddev% Speedup ----------------------------------------------------------------------------------------------- 1 818645.4 1097.02 0.13 774110.24 1562.45 0.2 0.95 2 1281013.04 2188.78 0.17 1238346.24 2149.97 0.17 0.97 4 2058514.16 13105.36 0.64 1985375 3204.48 0.16 0.96 8 1920414.8 16154.63 0.84 1911667.92 8882.98 0.46 1 16 1943729.68 8714.38 0.45 1978946.72 7465.65 0.38 1.02 24 1915846.88 7749.9 0.4 1914442.72 9841.71 0.51 1 32 1964695.92 8854.83 0.45 1914650.28 9357.82 0.49 0.97 40 1845071.12 5103.26 0.28 1891685.44 4278.34 0.23 1.03 48 1838897.6 5123.61 0.28 1843498.2 5391.94 0.29 1 56 1823768.32 3214.14 0.18 1736477.48 5675.49 0.33 0.95 64 1627162.36 3528.1 0.22 1685727.16 6102.63 0.36 1.04 72 1725320.16 4709.83 0.27 1710174.4 6707.54 0.39 0.99 80 1692288.44 9110.89 0.54 1773676.24 4327.94 0.24 1.05 88 1725496.64 4249.71 0.25 1695173.84 5097.14 0.3 0.98 96 1766093.08 2280.09 0.13 1732782.64 3606.1 0.21 0.98 104 1647753 2926.83 0.18 1710876.4 4416.04 0.26 1.04 112 1763785.52 3838.26 0.22 1803813.76 1859.2 0.1 1.02 120 1684095.16 2385.31 0.14 1766903.08 3258.34 0.18 1.05 128 1733528.56 2800.62 0.16 1677446.32 3201.14 0.19 0.97 136 1179187.84 6804.86 0.58 1241839.52 10698.51 0.86 1.05 144 969456.36 6421.85 0.66 1018441.96 8732.19 0.86 1.05 152 839295.64 10422.66 1.24 817531.92 6778.37 0.83 0.97 160 743010.72 6957.98 0.94 749291.16 9388.47 1.25 1.01 168 666049.88 13159.73 1.98 689408.08 10192.66 1.48 1.04 176 609185.56 5685.18 0.93 653744.24 10847.35 1.66 1.07 184 602232.08 12089.72 2.01 597718.6 13856.45 2.32 0.99 192 563919.32 9870.46 1.75 560080.4 8388.47 1.5 0.99 200 522396.28 4155.61 0.8 539168.64 10456.64 1.94 1.03 208 520328.28 9353.14 1.8 510011.4 6061.19 1.19 0.98 216 479797.72 5824.58 1.21 486955.32 4547.05 0.93 1.01 224 467943.8 4484.86 0.96 473252.76 5608.58 1.19 1.01 232 456914.24 3129.5 0.68 457463.2 7474.83 1.63 1 240 450535 5149.78 1.14 437653.56 4604.92 1.05 0.97 248 435475.2 2350.87 0.54 435589.24 6176.01 1.42 1 256 416737.88 2592.76 0.62 424178.28 3932.2 0.93 1.02 will-it-scale lock1_threads - arm64 Threads QL QL stddev stddev% RQL RQL stddev stddev% Speedup ----------------------------------------------------------------------------------------------- 1 2512077.52 3026.1 0.12 2085365.92 1612.44 0.08 0.83 2 4840180.4 3646.31 0.08 4326922.24 3802.17 0.09 0.89 4 9358779.44 6673.07 0.07 8467588.56 5577.05 0.07 0.9 8 9374436.88 18826.26 0.2 8635110.16 4217.66 0.05 0.92 16 9527184.08 14111.94 0.15 8561174.16 3258.6 0.04 0.9 24 8873099.76 17242.32 0.19 9286778.72 4124.51 0.04 1.05 32 8457640.4 10790.92 0.13 8700401.52 5110 0.06 1.03 40 8478771.76 13250.8 0.16 8746198.16 7606.42 0.09 1.03 48 8329097.76 7958.92 0.1 8774265.36 6082.08 0.07 1.05 56 8330143.04 11586.93 0.14 8472426.48 7402.13 0.09 1.02 64 8334684.08 10478.03 0.13 7979193.52 8436.63 0.11 0.96 72 7941815.52 16031.38 0.2 8016885.52 12640.56 0.16 1.01 80 8042221.68 10219.93 0.13 8072222.88 12479.54 0.15 1 88 8190336.8 10751.38 0.13 8432977.6 11865.67 0.14 1.03 96 8235010.08 7267.8 0.09 8022101.28 11910.63 0.15 0.97 104 8154434.08 7770.8 0.17987812
7647.42 0.1 0.98 112 7738464.56 11067.72 0.14 7968483.92 20632.93 0.26 1.03 120 8228919.36 10395.79 0.13 8304329.28 11913.76 0.14 1.01 128 7798646.64 8877.8 0.11 8197938.4 7527.81 0.09 1.05 136 5567293.68 66259.82 1.19 5642017.12 126584.59 2.24 1.01 144 4425655.52 55729.96 1.26 4519874.64 82996.01 1.84 1.02 152 3871300.8 77793.78 2.01 3850025.04 80167.3 2.08 0.99 160 3558041.68 55108.3 1.55 3495924.96 83626.42 2.39 0.98 168 3302042.72 45011.89 1.36 3298002.8 59393.64 1.8 1 176 3066165.2 34896.54 1.14 3063027.44 58219.26 1.9 1 184 2817899.6 43585.27 1.55 2859393.84 45258.03 1.58 1.01 192 2690403.76 42236.77 1.57 2630652.24 35953.13 1.37 0.98 200 2563141.44 28145.43 1.1 2539964.32 38556.52 1.52 0.99 208 2502968.8 27687.81 1.11 2477757.28 28240.81 1.14 0.99 216 2474917.76 24128.71 0.97 2483161.44 32198.37 1.3 1 224 2386874.72 32954.66 1.38 2398068.48 37667.29 1.57 1 232 2379248.24 27413.4 1.15 2327601.68 24565.28 1.06 0.98 240 2302146.64 19914.19 0.87 2236074.64 20968.17 0.94 0.97 248 2241798.32 21542.52 0.96 2173312.24 26498.36 1.22 0.97 256 2198765.12 20832.66 0.95 2136159.52 25027.96 1.17 0.97 will-it-scale lock2_threads - arm64 Threads QL QL stddev stddev% RQL RQL stddev stddev% Speedup ----------------------------------------------------------------------------------------------- 1 2499414.32 1932.27 0.08 2075704.8 24589.71 1.18 0.83 2 3887820 34198.36 0.88 4057432.64 11896.04 0.29 1.04 4 3445307.6 7958.3 0.23 3869960.4 3788.5 0.1 1.12 8 4310597.2 14405.9 0.33 3931319.76 5845.33 0.15 0.91 16 3995159.84 22621.85 0.57 3953339.68 15668.9 0.4 0.99 24 4048456.88 22956.51 0.57 3887812.64 30584.77 0.79 0.96 32 3974808.64 20465.87 0.51 3718778.08 27407.24 0.74 0.94 40 3941154.88 15136.68 0.38 3551464.24 33378.67 0.94 0.9 48 3725436.32 17090.67 0.46 3714356.08 19035.26 0.51 1 56 3558449.44 10123.46 0.28 3449656.08 36476.87 1.06 0.97 64 3514616.08 16470.99 0.47 3493197.04 25639.82 0.73 0.99 72 3461700.88 16780.97 0.48 3376565.04 16930.19 0.5 0.98 80 3797008.64 17599.05 0.46 3505856.16 34320.34 0.98 0.92 88 3737459.44 10774.93 0.29 3631757.68 24231.29 0.67 0.97 96 3612816.16 21865.86 0.61 3545354.56 16391.15 0.46 0.98 104 3765167.36 17763.8 0.47 3466467.12 22235.45 0.64 0.92 112 3713386 15455.21 0.42 3402210 18349.66 0.54 0.92 120 3699986.08 15153.08 0.41 3580303.92 19823.01 0.55 0.97 128 3648694.56 11891.62 0.33 3426445.28 22993.32 0.67 0.94 136 800046.88 6039.73 0.75 784412.16 9062.03 1.16 0.98 144 769483.36 5231.74 0.68 714132.8 8953.57 1.25 0.93 152 821081.52 4249.12 0.52 743694.64 8155.18 1.1 0.91 160 789040.16 9187.4 1.16 834865.44 6159.29 0.74 1.06 168 867742.4 8967.66 1.03 734905.36 15582.75 2.12 0.85 176 838650.32 7949.72 0.95 846939.68 8959.8 1.06 1.01 184 854984.48 19475.51 2.28 794549.92 11924.54 1.5 0.93 192 846262.32 13795.86 1.63 899915.12 8639.82 0.96 1.06 200 942602.16 12665.42 1.34 900385.76 8592.23 0.95 0.96 208 954183.68 12853.22 1.35 1166186.96 13045.03 1.12 1.22 216 929319.76 10157.79 1.09 926773.76 10577.01 1.14 1 224 967896.56 9819.6 1.01 951144.32 12343.83 1.3 0.98 232 990621.12 7771.97 0.78 916361.2 17878.44 1.95 0.93 240 995285.04 20104.22 2.02 972119.6 12856.42 1.32 0.98 2481029436
20404.97 1.98 965301.28 11102.95 1.15 0.94 256 1038724.8 19201.03 1.85 1029942.08 12563.07 1.22 0.99 Written By ---------- Alexei Starovoitov <ast@kernel.org> Kumar Kartikeya Dwivedi <memxor@gmail.com> [0]: https://www.cs.rochester.edu/research/synchronization/pseudocode/timeout.html [1]: https://dl.acm.org/doi/10.1145/571825.571830 [2]: https://github.com/kkdwivedi/rqspinlock/blob/main/rqspinlock.pdf [3]: https://git.kernel.org/pub/scm/linux/kernel/git/cmarinas/kernel-tla.git/plain/qspinlock.tla [4]: https://arxiv.org/pdf/1810.05600 ==================== Link: https://patch.msgid.link/20250316040541.108729-1-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
This commit is contained in:
commit
6ffb9017e9
27 changed files with 2315 additions and 420 deletions
|
@ -4297,6 +4297,8 @@ F: include/uapi/linux/filter.h
|
|||
F: kernel/bpf/
|
||||
F: kernel/trace/bpf_trace.c
|
||||
F: lib/buildid.c
|
||||
F: arch/*/include/asm/rqspinlock.h
|
||||
F: include/asm-generic/rqspinlock.h
|
||||
F: lib/test_bpf.c
|
||||
F: net/bpf/
|
||||
F: net/core/filter.c
|
||||
|
|
93
arch/arm64/include/asm/rqspinlock.h
Normal file
93
arch/arm64/include/asm/rqspinlock.h
Normal file
|
@ -0,0 +1,93 @@
|
|||
/* SPDX-License-Identifier: GPL-2.0 */
|
||||
#ifndef _ASM_RQSPINLOCK_H
|
||||
#define _ASM_RQSPINLOCK_H
|
||||
|
||||
#include <asm/barrier.h>
|
||||
|
||||
/*
|
||||
* Hardcode res_smp_cond_load_acquire implementations for arm64 to a custom
|
||||
* version based on [0]. In rqspinlock code, our conditional expression involves
|
||||
* checking the value _and_ additionally a timeout. However, on arm64, the
|
||||
* WFE-based implementation may never spin again if no stores occur to the
|
||||
* locked byte in the lock word. As such, we may be stuck forever if
|
||||
* event-stream based unblocking is not available on the platform for WFE spin
|
||||
* loops (arch_timer_evtstrm_available).
|
||||
*
|
||||
* Once support for smp_cond_load_acquire_timewait [0] lands, we can drop this
|
||||
* copy-paste.
|
||||
*
|
||||
* While we rely on the implementation to amortize the cost of sampling
|
||||
* cond_expr for us, it will not happen when event stream support is
|
||||
* unavailable, time_expr check is amortized. This is not the common case, and
|
||||
* it would be difficult to fit our logic in the time_expr_ns >= time_limit_ns
|
||||
* comparison, hence just let it be. In case of event-stream, the loop is woken
|
||||
* up at microsecond granularity.
|
||||
*
|
||||
* [0]: https://lore.kernel.org/lkml/20250203214911.898276-1-ankur.a.arora@oracle.com
|
||||
*/
|
||||
|
||||
#ifndef smp_cond_load_acquire_timewait
|
||||
|
||||
#define smp_cond_time_check_count 200
|
||||
|
||||
#define __smp_cond_load_relaxed_spinwait(ptr, cond_expr, time_expr_ns, \
|
||||
time_limit_ns) ({ \
|
||||
typeof(ptr) __PTR = (ptr); \
|
||||
__unqual_scalar_typeof(*ptr) VAL; \
|
||||
unsigned int __count = 0; \
|
||||
for (;;) { \
|
||||
VAL = READ_ONCE(*__PTR); \
|
||||
if (cond_expr) \
|
||||
break; \
|
||||
cpu_relax(); \
|
||||
if (__count++ < smp_cond_time_check_count) \
|
||||
continue; \
|
||||
if ((time_expr_ns) >= (time_limit_ns)) \
|
||||
break; \
|
||||
__count = 0; \
|
||||
} \
|
||||
(typeof(*ptr))VAL; \
|
||||
})
|
||||
|
||||
#define __smp_cond_load_acquire_timewait(ptr, cond_expr, \
|
||||
time_expr_ns, time_limit_ns) \
|
||||
({ \
|
||||
typeof(ptr) __PTR = (ptr); \
|
||||
__unqual_scalar_typeof(*ptr) VAL; \
|
||||
for (;;) { \
|
||||
VAL = smp_load_acquire(__PTR); \
|
||||
if (cond_expr) \
|
||||
break; \
|
||||
__cmpwait_relaxed(__PTR, VAL); \
|
||||
if ((time_expr_ns) >= (time_limit_ns)) \
|
||||
break; \
|
||||
} \
|
||||
(typeof(*ptr))VAL; \
|
||||
})
|
||||
|
||||
#define smp_cond_load_acquire_timewait(ptr, cond_expr, \
|
||||
time_expr_ns, time_limit_ns) \
|
||||
({ \
|
||||
__unqual_scalar_typeof(*ptr) _val; \
|
||||
int __wfe = arch_timer_evtstrm_available(); \
|
||||
\
|
||||
if (likely(__wfe)) { \
|
||||
_val = __smp_cond_load_acquire_timewait(ptr, cond_expr, \
|
||||
time_expr_ns, \
|
||||
time_limit_ns); \
|
||||
} else { \
|
||||
_val = __smp_cond_load_relaxed_spinwait(ptr, cond_expr, \
|
||||
time_expr_ns, \
|
||||
time_limit_ns); \
|
||||
smp_acquire__after_ctrl_dep(); \
|
||||
} \
|
||||
(typeof(*ptr))_val; \
|
||||
})
|
||||
|
||||
#endif
|
||||
|
||||
#define res_smp_cond_load_acquire_timewait(v, c) smp_cond_load_acquire_timewait(v, c, 0, 1)
|
||||
|
||||
#include <asm-generic/rqspinlock.h>
|
||||
|
||||
#endif /* _ASM_RQSPINLOCK_H */
|
33
arch/x86/include/asm/rqspinlock.h
Normal file
33
arch/x86/include/asm/rqspinlock.h
Normal file
|
@ -0,0 +1,33 @@
|
|||
/* SPDX-License-Identifier: GPL-2.0 */
|
||||
#ifndef _ASM_X86_RQSPINLOCK_H
|
||||
#define _ASM_X86_RQSPINLOCK_H
|
||||
|
||||
#include <asm/paravirt.h>
|
||||
|
||||
#ifdef CONFIG_PARAVIRT
|
||||
DECLARE_STATIC_KEY_FALSE(virt_spin_lock_key);
|
||||
|
||||
#define resilient_virt_spin_lock_enabled resilient_virt_spin_lock_enabled
|
||||
static __always_inline bool resilient_virt_spin_lock_enabled(void)
|
||||
{
|
||||
return static_branch_likely(&virt_spin_lock_key);
|
||||
}
|
||||
|
||||
#ifdef CONFIG_QUEUED_SPINLOCKS
|
||||
typedef struct qspinlock rqspinlock_t;
|
||||
#else
|
||||
typedef struct rqspinlock rqspinlock_t;
|
||||
#endif
|
||||
extern int resilient_tas_spin_lock(rqspinlock_t *lock);
|
||||
|
||||
#define resilient_virt_spin_lock resilient_virt_spin_lock
|
||||
static inline int resilient_virt_spin_lock(rqspinlock_t *lock)
|
||||
{
|
||||
return resilient_tas_spin_lock(lock);
|
||||
}
|
||||
|
||||
#endif /* CONFIG_PARAVIRT */
|
||||
|
||||
#include <asm-generic/rqspinlock.h>
|
||||
|
||||
#endif /* _ASM_X86_RQSPINLOCK_H */
|
|
@ -45,6 +45,7 @@ mandatory-y += pci.h
|
|||
mandatory-y += percpu.h
|
||||
mandatory-y += pgalloc.h
|
||||
mandatory-y += preempt.h
|
||||
mandatory-y += rqspinlock.h
|
||||
mandatory-y += runtime-const.h
|
||||
mandatory-y += rwonce.h
|
||||
mandatory-y += sections.h
|
||||
|
|
|
@ -1,6 +1,12 @@
|
|||
#ifndef __ASM_MCS_SPINLOCK_H
|
||||
#define __ASM_MCS_SPINLOCK_H
|
||||
|
||||
struct mcs_spinlock {
|
||||
struct mcs_spinlock *next;
|
||||
int locked; /* 1 if lock acquired */
|
||||
int count; /* nesting count, see qspinlock.c */
|
||||
};
|
||||
|
||||
/*
|
||||
* Architectures can define their own:
|
||||
*
|
||||
|
|
250
include/asm-generic/rqspinlock.h
Normal file
250
include/asm-generic/rqspinlock.h
Normal file
|
@ -0,0 +1,250 @@
|
|||
/* SPDX-License-Identifier: GPL-2.0-or-later */
|
||||
/*
|
||||
* Resilient Queued Spin Lock
|
||||
*
|
||||
* (C) Copyright 2024-2025 Meta Platforms, Inc. and affiliates.
|
||||
*
|
||||
* Authors: Kumar Kartikeya Dwivedi <memxor@gmail.com>
|
||||
*/
|
||||
#ifndef __ASM_GENERIC_RQSPINLOCK_H
|
||||
#define __ASM_GENERIC_RQSPINLOCK_H
|
||||
|
||||
#include <linux/types.h>
|
||||
#include <vdso/time64.h>
|
||||
#include <linux/percpu.h>
|
||||
#ifdef CONFIG_QUEUED_SPINLOCKS
|
||||
#include <asm/qspinlock.h>
|
||||
#endif
|
||||
|
||||
struct rqspinlock {
|
||||
union {
|
||||
atomic_t val;
|
||||
u32 locked;
|
||||
};
|
||||
};
|
||||
|
||||
/* Even though this is same as struct rqspinlock, we need to emit a distinct
|
||||
* type in BTF for BPF programs.
|
||||
*/
|
||||
struct bpf_res_spin_lock {
|
||||
u32 val;
|
||||
};
|
||||
|
||||
struct qspinlock;
|
||||
#ifdef CONFIG_QUEUED_SPINLOCKS
|
||||
typedef struct qspinlock rqspinlock_t;
|
||||
#else
|
||||
typedef struct rqspinlock rqspinlock_t;
|
||||
#endif
|
||||
|
||||
extern int resilient_tas_spin_lock(rqspinlock_t *lock);
|
||||
#ifdef CONFIG_QUEUED_SPINLOCKS
|
||||
extern int resilient_queued_spin_lock_slowpath(rqspinlock_t *lock, u32 val);
|
||||
#endif
|
||||
|
||||
#ifndef resilient_virt_spin_lock_enabled
|
||||
static __always_inline bool resilient_virt_spin_lock_enabled(void)
|
||||
{
|
||||
return false;
|
||||
}
|
||||
#endif
|
||||
|
||||
#ifndef resilient_virt_spin_lock
|
||||
static __always_inline int resilient_virt_spin_lock(rqspinlock_t *lock)
|
||||
{
|
||||
return 0;
|
||||
}
|
||||
#endif
|
||||
|
||||
/*
|
||||
* Default timeout for waiting loops is 0.25 seconds
|
||||
*/
|
||||
#define RES_DEF_TIMEOUT (NSEC_PER_SEC / 4)
|
||||
|
||||
/*
|
||||
* Choose 31 as it makes rqspinlock_held cacheline-aligned.
|
||||
*/
|
||||
#define RES_NR_HELD 31
|
||||
|
||||
struct rqspinlock_held {
|
||||
int cnt;
|
||||
void *locks[RES_NR_HELD];
|
||||
};
|
||||
|
||||
DECLARE_PER_CPU_ALIGNED(struct rqspinlock_held, rqspinlock_held_locks);
|
||||
|
||||
static __always_inline void grab_held_lock_entry(void *lock)
|
||||
{
|
||||
int cnt = this_cpu_inc_return(rqspinlock_held_locks.cnt);
|
||||
|
||||
if (unlikely(cnt > RES_NR_HELD)) {
|
||||
/* Still keep the inc so we decrement later. */
|
||||
return;
|
||||
}
|
||||
|
||||
/*
|
||||
* Implied compiler barrier in per-CPU operations; otherwise we can have
|
||||
* the compiler reorder inc with write to table, allowing interrupts to
|
||||
* overwrite and erase our write to the table (as on interrupt exit it
|
||||
* will be reset to NULL).
|
||||
*
|
||||
* It is fine for cnt inc to be reordered wrt remote readers though,
|
||||
* they won't observe our entry until the cnt update is visible, that's
|
||||
* all.
|
||||
*/
|
||||
this_cpu_write(rqspinlock_held_locks.locks[cnt - 1], lock);
|
||||
}
|
||||
|
||||
/*
|
||||
* We simply don't support out-of-order unlocks, and keep the logic simple here.
|
||||
* The verifier prevents BPF programs from unlocking out-of-order, and the same
|
||||
* holds for in-kernel users.
|
||||
*
|
||||
* It is possible to run into misdetection scenarios of AA deadlocks on the same
|
||||
* CPU, and missed ABBA deadlocks on remote CPUs if this function pops entries
|
||||
* out of order (due to lock A, lock B, unlock A, unlock B) pattern. The correct
|
||||
* logic to preserve right entries in the table would be to walk the array of
|
||||
* held locks and swap and clear out-of-order entries, but that's too
|
||||
* complicated and we don't have a compelling use case for out of order unlocking.
|
||||
*/
|
||||
static __always_inline void release_held_lock_entry(void)
|
||||
{
|
||||
struct rqspinlock_held *rqh = this_cpu_ptr(&rqspinlock_held_locks);
|
||||
|
||||
if (unlikely(rqh->cnt > RES_NR_HELD))
|
||||
goto dec;
|
||||
WRITE_ONCE(rqh->locks[rqh->cnt - 1], NULL);
|
||||
dec:
|
||||
/*
|
||||
* Reordering of clearing above with inc and its write in
|
||||
* grab_held_lock_entry that came before us (in same acquisition
|
||||
* attempt) is ok, we either see a valid entry or NULL when it's
|
||||
* visible.
|
||||
*
|
||||
* But this helper is invoked when we unwind upon failing to acquire the
|
||||
* lock. Unlike the unlock path which constitutes a release store after
|
||||
* we clear the entry, we need to emit a write barrier here. Otherwise,
|
||||
* we may have a situation as follows:
|
||||
*
|
||||
* <error> for lock B
|
||||
* release_held_lock_entry
|
||||
*
|
||||
* try_cmpxchg_acquire for lock A
|
||||
* grab_held_lock_entry
|
||||
*
|
||||
* Lack of any ordering means reordering may occur such that dec, inc
|
||||
* are done before entry is overwritten. This permits a remote lock
|
||||
* holder of lock B (which this CPU failed to acquire) to now observe it
|
||||
* as being attempted on this CPU, and may lead to misdetection (if this
|
||||
* CPU holds a lock it is attempting to acquire, leading to false ABBA
|
||||
* diagnosis).
|
||||
*
|
||||
* In case of unlock, we will always do a release on the lock word after
|
||||
* releasing the entry, ensuring that other CPUs cannot hold the lock
|
||||
* (and make conclusions about deadlocks) until the entry has been
|
||||
* cleared on the local CPU, preventing any anomalies. Reordering is
|
||||
* still possible there, but a remote CPU cannot observe a lock in our
|
||||
* table which it is already holding, since visibility entails our
|
||||
* release store for the said lock has not retired.
|
||||
*
|
||||
* In theory we don't have a problem if the dec and WRITE_ONCE above get
|
||||
* reordered with each other, we either notice an empty NULL entry on
|
||||
* top (if dec succeeds WRITE_ONCE), or a potentially stale entry which
|
||||
* cannot be observed (if dec precedes WRITE_ONCE).
|
||||
*
|
||||
* Emit the write barrier _before_ the dec, this permits dec-inc
|
||||
* reordering but that is harmless as we'd have new entry set to NULL
|
||||
* already, i.e. they cannot precede the NULL store above.
|
||||
*/
|
||||
smp_wmb();
|
||||
this_cpu_dec(rqspinlock_held_locks.cnt);
|
||||
}
|
||||
|
||||
#ifdef CONFIG_QUEUED_SPINLOCKS
|
||||
|
||||
/**
|
||||
* res_spin_lock - acquire a queued spinlock
|
||||
* @lock: Pointer to queued spinlock structure
|
||||
*
|
||||
* Return:
|
||||
* * 0 - Lock was acquired successfully.
|
||||
* * -EDEADLK - Lock acquisition failed because of AA/ABBA deadlock.
|
||||
* * -ETIMEDOUT - Lock acquisition failed because of timeout.
|
||||
*/
|
||||
static __always_inline int res_spin_lock(rqspinlock_t *lock)
|
||||
{
|
||||
int val = 0;
|
||||
|
||||
if (likely(atomic_try_cmpxchg_acquire(&lock->val, &val, _Q_LOCKED_VAL))) {
|
||||
grab_held_lock_entry(lock);
|
||||
return 0;
|
||||
}
|
||||
return resilient_queued_spin_lock_slowpath(lock, val);
|
||||
}
|
||||
|
||||
#else
|
||||
|
||||
#define res_spin_lock(lock) resilient_tas_spin_lock(lock)
|
||||
|
||||
#endif /* CONFIG_QUEUED_SPINLOCKS */
|
||||
|
||||
static __always_inline void res_spin_unlock(rqspinlock_t *lock)
|
||||
{
|
||||
struct rqspinlock_held *rqh = this_cpu_ptr(&rqspinlock_held_locks);
|
||||
|
||||
if (unlikely(rqh->cnt > RES_NR_HELD))
|
||||
goto unlock;
|
||||
WRITE_ONCE(rqh->locks[rqh->cnt - 1], NULL);
|
||||
unlock:
|
||||
/*
|
||||
* Release barrier, ensures correct ordering. See release_held_lock_entry
|
||||
* for details. Perform release store instead of queued_spin_unlock,
|
||||
* since we use this function for test-and-set fallback as well. When we
|
||||
* have CONFIG_QUEUED_SPINLOCKS=n, we clear the full 4-byte lockword.
|
||||
*
|
||||
* Like release_held_lock_entry, we can do the release before the dec.
|
||||
* We simply care about not seeing the 'lock' in our table from a remote
|
||||
* CPU once the lock has been released, which doesn't rely on the dec.
|
||||
*
|
||||
* Unlike smp_wmb(), release is not a two way fence, hence it is
|
||||
* possible for a inc to move up and reorder with our clearing of the
|
||||
* entry. This isn't a problem however, as for a misdiagnosis of ABBA,
|
||||
* the remote CPU needs to hold this lock, which won't be released until
|
||||
* the store below is done, which would ensure the entry is overwritten
|
||||
* to NULL, etc.
|
||||
*/
|
||||
smp_store_release(&lock->locked, 0);
|
||||
this_cpu_dec(rqspinlock_held_locks.cnt);
|
||||
}
|
||||
|
||||
#ifdef CONFIG_QUEUED_SPINLOCKS
|
||||
#define raw_res_spin_lock_init(lock) ({ *(lock) = (rqspinlock_t)__ARCH_SPIN_LOCK_UNLOCKED; })
|
||||
#else
|
||||
#define raw_res_spin_lock_init(lock) ({ *(lock) = (rqspinlock_t){0}; })
|
||||
#endif
|
||||
|
||||
#define raw_res_spin_lock(lock) \
|
||||
({ \
|
||||
int __ret; \
|
||||
preempt_disable(); \
|
||||
__ret = res_spin_lock(lock); \
|
||||
if (__ret) \
|
||||
preempt_enable(); \
|
||||
__ret; \
|
||||
})
|
||||
|
||||
#define raw_res_spin_unlock(lock) ({ res_spin_unlock(lock); preempt_enable(); })
|
||||
|
||||
#define raw_res_spin_lock_irqsave(lock, flags) \
|
||||
({ \
|
||||
int __ret; \
|
||||
local_irq_save(flags); \
|
||||
__ret = raw_res_spin_lock(lock); \
|
||||
if (__ret) \
|
||||
local_irq_restore(flags); \
|
||||
__ret; \
|
||||
})
|
||||
|
||||
#define raw_res_spin_unlock_irqrestore(lock, flags) ({ raw_res_spin_unlock(lock); local_irq_restore(flags); })
|
||||
|
||||
#endif /* __ASM_GENERIC_RQSPINLOCK_H */
|
|
@ -30,6 +30,7 @@
|
|||
#include <linux/static_call.h>
|
||||
#include <linux/memcontrol.h>
|
||||
#include <linux/cfi.h>
|
||||
#include <asm/rqspinlock.h>
|
||||
|
||||
struct bpf_verifier_env;
|
||||
struct bpf_verifier_log;
|
||||
|
@ -204,6 +205,7 @@ enum btf_field_type {
|
|||
BPF_REFCOUNT = (1 << 9),
|
||||
BPF_WORKQUEUE = (1 << 10),
|
||||
BPF_UPTR = (1 << 11),
|
||||
BPF_RES_SPIN_LOCK = (1 << 12),
|
||||
};
|
||||
|
||||
typedef void (*btf_dtor_kfunc_t)(void *);
|
||||
|
@ -239,6 +241,7 @@ struct btf_record {
|
|||
u32 cnt;
|
||||
u32 field_mask;
|
||||
int spin_lock_off;
|
||||
int res_spin_lock_off;
|
||||
int timer_off;
|
||||
int wq_off;
|
||||
int refcount_off;
|
||||
|
@ -314,6 +317,8 @@ static inline const char *btf_field_type_name(enum btf_field_type type)
|
|||
switch (type) {
|
||||
case BPF_SPIN_LOCK:
|
||||
return "bpf_spin_lock";
|
||||
case BPF_RES_SPIN_LOCK:
|
||||
return "bpf_res_spin_lock";
|
||||
case BPF_TIMER:
|
||||
return "bpf_timer";
|
||||
case BPF_WORKQUEUE:
|
||||
|
@ -346,6 +351,8 @@ static inline u32 btf_field_type_size(enum btf_field_type type)
|
|||
switch (type) {
|
||||
case BPF_SPIN_LOCK:
|
||||
return sizeof(struct bpf_spin_lock);
|
||||
case BPF_RES_SPIN_LOCK:
|
||||
return sizeof(struct bpf_res_spin_lock);
|
||||
case BPF_TIMER:
|
||||
return sizeof(struct bpf_timer);
|
||||
case BPF_WORKQUEUE:
|
||||
|
@ -376,6 +383,8 @@ static inline u32 btf_field_type_align(enum btf_field_type type)
|
|||
switch (type) {
|
||||
case BPF_SPIN_LOCK:
|
||||
return __alignof__(struct bpf_spin_lock);
|
||||
case BPF_RES_SPIN_LOCK:
|
||||
return __alignof__(struct bpf_res_spin_lock);
|
||||
case BPF_TIMER:
|
||||
return __alignof__(struct bpf_timer);
|
||||
case BPF_WORKQUEUE:
|
||||
|
@ -419,6 +428,7 @@ static inline void bpf_obj_init_field(const struct btf_field *field, void *addr)
|
|||
case BPF_RB_ROOT:
|
||||
/* RB_ROOT_CACHED 0-inits, no need to do anything after memset */
|
||||
case BPF_SPIN_LOCK:
|
||||
case BPF_RES_SPIN_LOCK:
|
||||
case BPF_TIMER:
|
||||
case BPF_WORKQUEUE:
|
||||
case BPF_KPTR_UNREF:
|
||||
|
|
|
@ -115,6 +115,14 @@ struct bpf_reg_state {
|
|||
int depth:30;
|
||||
} iter;
|
||||
|
||||
/* For irq stack slots */
|
||||
struct {
|
||||
enum {
|
||||
IRQ_NATIVE_KFUNC,
|
||||
IRQ_LOCK_KFUNC,
|
||||
} kfunc_class;
|
||||
} irq;
|
||||
|
||||
/* Max size from any of the above. */
|
||||
struct {
|
||||
unsigned long raw1;
|
||||
|
@ -255,9 +263,12 @@ struct bpf_reference_state {
|
|||
* default to pointer reference on zero initialization of a state.
|
||||
*/
|
||||
enum ref_state_type {
|
||||
REF_TYPE_PTR = 1,
|
||||
REF_TYPE_IRQ = 2,
|
||||
REF_TYPE_LOCK = 3,
|
||||
REF_TYPE_PTR = (1 << 1),
|
||||
REF_TYPE_IRQ = (1 << 2),
|
||||
REF_TYPE_LOCK = (1 << 3),
|
||||
REF_TYPE_RES_LOCK = (1 << 4),
|
||||
REF_TYPE_RES_LOCK_IRQ = (1 << 5),
|
||||
REF_TYPE_LOCK_MASK = REF_TYPE_LOCK | REF_TYPE_RES_LOCK | REF_TYPE_RES_LOCK_IRQ,
|
||||
} type;
|
||||
/* Track each reference created with a unique id, even if the same
|
||||
* instruction creates the reference multiple times (eg, via CALL).
|
||||
|
@ -424,6 +435,8 @@ struct bpf_verifier_state {
|
|||
u32 active_locks;
|
||||
u32 active_preempt_locks;
|
||||
u32 active_irq_id;
|
||||
u32 active_lock_id;
|
||||
void *active_lock_ptr;
|
||||
bool active_rcu_lock;
|
||||
|
||||
bool speculative;
|
||||
|
|
|
@ -14,7 +14,7 @@ obj-$(CONFIG_BPF_SYSCALL) += bpf_local_storage.o bpf_task_storage.o
|
|||
obj-${CONFIG_BPF_LSM} += bpf_inode_storage.o
|
||||
obj-$(CONFIG_BPF_SYSCALL) += disasm.o mprog.o
|
||||
obj-$(CONFIG_BPF_JIT) += trampoline.o
|
||||
obj-$(CONFIG_BPF_SYSCALL) += btf.o memalloc.o
|
||||
obj-$(CONFIG_BPF_SYSCALL) += btf.o memalloc.o rqspinlock.o
|
||||
ifeq ($(CONFIG_MMU)$(CONFIG_64BIT),yy)
|
||||
obj-$(CONFIG_BPF_SYSCALL) += arena.o range_tree.o
|
||||
endif
|
||||
|
|
|
@ -3481,6 +3481,15 @@ static int btf_get_field_type(const struct btf *btf, const struct btf_type *var_
|
|||
goto end;
|
||||
}
|
||||
}
|
||||
if (field_mask & BPF_RES_SPIN_LOCK) {
|
||||
if (!strcmp(name, "bpf_res_spin_lock")) {
|
||||
if (*seen_mask & BPF_RES_SPIN_LOCK)
|
||||
return -E2BIG;
|
||||
*seen_mask |= BPF_RES_SPIN_LOCK;
|
||||
type = BPF_RES_SPIN_LOCK;
|
||||
goto end;
|
||||
}
|
||||
}
|
||||
if (field_mask & BPF_TIMER) {
|
||||
if (!strcmp(name, "bpf_timer")) {
|
||||
if (*seen_mask & BPF_TIMER)
|
||||
|
@ -3659,6 +3668,7 @@ static int btf_find_field_one(const struct btf *btf,
|
|||
|
||||
switch (field_type) {
|
||||
case BPF_SPIN_LOCK:
|
||||
case BPF_RES_SPIN_LOCK:
|
||||
case BPF_TIMER:
|
||||
case BPF_WORKQUEUE:
|
||||
case BPF_LIST_NODE:
|
||||
|
@ -3952,6 +3962,7 @@ struct btf_record *btf_parse_fields(const struct btf *btf, const struct btf_type
|
|||
return ERR_PTR(-ENOMEM);
|
||||
|
||||
rec->spin_lock_off = -EINVAL;
|
||||
rec->res_spin_lock_off = -EINVAL;
|
||||
rec->timer_off = -EINVAL;
|
||||
rec->wq_off = -EINVAL;
|
||||
rec->refcount_off = -EINVAL;
|
||||
|
@ -3979,6 +3990,11 @@ struct btf_record *btf_parse_fields(const struct btf *btf, const struct btf_type
|
|||
/* Cache offset for faster lookup at runtime */
|
||||
rec->spin_lock_off = rec->fields[i].offset;
|
||||
break;
|
||||
case BPF_RES_SPIN_LOCK:
|
||||
WARN_ON_ONCE(rec->spin_lock_off >= 0);
|
||||
/* Cache offset for faster lookup at runtime */
|
||||
rec->res_spin_lock_off = rec->fields[i].offset;
|
||||
break;
|
||||
case BPF_TIMER:
|
||||
WARN_ON_ONCE(rec->timer_off >= 0);
|
||||
/* Cache offset for faster lookup at runtime */
|
||||
|
@ -4022,9 +4038,15 @@ struct btf_record *btf_parse_fields(const struct btf *btf, const struct btf_type
|
|||
rec->cnt++;
|
||||
}
|
||||
|
||||
if (rec->spin_lock_off >= 0 && rec->res_spin_lock_off >= 0) {
|
||||
ret = -EINVAL;
|
||||
goto end;
|
||||
}
|
||||
|
||||
/* bpf_{list_head, rb_node} require bpf_spin_lock */
|
||||
if ((btf_record_has_field(rec, BPF_LIST_HEAD) ||
|
||||
btf_record_has_field(rec, BPF_RB_ROOT)) && rec->spin_lock_off < 0) {
|
||||
btf_record_has_field(rec, BPF_RB_ROOT)) &&
|
||||
(rec->spin_lock_off < 0 && rec->res_spin_lock_off < 0)) {
|
||||
ret = -EINVAL;
|
||||
goto end;
|
||||
}
|
||||
|
@ -5637,7 +5659,7 @@ btf_parse_struct_metas(struct bpf_verifier_log *log, struct btf *btf)
|
|||
|
||||
type = &tab->types[tab->cnt];
|
||||
type->btf_id = i;
|
||||
record = btf_parse_fields(btf, t, BPF_SPIN_LOCK | BPF_LIST_HEAD | BPF_LIST_NODE |
|
||||
record = btf_parse_fields(btf, t, BPF_SPIN_LOCK | BPF_RES_SPIN_LOCK | BPF_LIST_HEAD | BPF_LIST_NODE |
|
||||
BPF_RB_ROOT | BPF_RB_NODE | BPF_REFCOUNT |
|
||||
BPF_KPTR, t->size);
|
||||
/* The record cannot be unset, treat it as an error if so */
|
||||
|
|
|
@ -16,6 +16,7 @@
|
|||
#include "bpf_lru_list.h"
|
||||
#include "map_in_map.h"
|
||||
#include <linux/bpf_mem_alloc.h>
|
||||
#include <asm/rqspinlock.h>
|
||||
|
||||
#define HTAB_CREATE_FLAG_MASK \
|
||||
(BPF_F_NO_PREALLOC | BPF_F_NO_COMMON_LRU | BPF_F_NUMA_NODE | \
|
||||
|
@ -78,7 +79,7 @@
|
|||
*/
|
||||
struct bucket {
|
||||
struct hlist_nulls_head head;
|
||||
raw_spinlock_t raw_lock;
|
||||
rqspinlock_t raw_lock;
|
||||
};
|
||||
|
||||
#define HASHTAB_MAP_LOCK_COUNT 8
|
||||
|
@ -104,8 +105,6 @@ struct bpf_htab {
|
|||
u32 n_buckets; /* number of hash buckets */
|
||||
u32 elem_size; /* size of each element in bytes */
|
||||
u32 hashrnd;
|
||||
struct lock_class_key lockdep_key;
|
||||
int __percpu *map_locked[HASHTAB_MAP_LOCK_COUNT];
|
||||
};
|
||||
|
||||
/* each htab element is struct htab_elem + key + value */
|
||||
|
@ -140,45 +139,26 @@ static void htab_init_buckets(struct bpf_htab *htab)
|
|||
|
||||
for (i = 0; i < htab->n_buckets; i++) {
|
||||
INIT_HLIST_NULLS_HEAD(&htab->buckets[i].head, i);
|
||||
raw_spin_lock_init(&htab->buckets[i].raw_lock);
|
||||
lockdep_set_class(&htab->buckets[i].raw_lock,
|
||||
&htab->lockdep_key);
|
||||
raw_res_spin_lock_init(&htab->buckets[i].raw_lock);
|
||||
cond_resched();
|
||||
}
|
||||
}
|
||||
|
||||
static inline int htab_lock_bucket(const struct bpf_htab *htab,
|
||||
struct bucket *b, u32 hash,
|
||||
unsigned long *pflags)
|
||||
static inline int htab_lock_bucket(struct bucket *b, unsigned long *pflags)
|
||||
{
|
||||
unsigned long flags;
|
||||
int ret;
|
||||
|
||||
hash = hash & min_t(u32, HASHTAB_MAP_LOCK_MASK, htab->n_buckets - 1);
|
||||
|
||||
preempt_disable();
|
||||
local_irq_save(flags);
|
||||
if (unlikely(__this_cpu_inc_return(*(htab->map_locked[hash])) != 1)) {
|
||||
__this_cpu_dec(*(htab->map_locked[hash]));
|
||||
local_irq_restore(flags);
|
||||
preempt_enable();
|
||||
return -EBUSY;
|
||||
}
|
||||
|
||||
raw_spin_lock(&b->raw_lock);
|
||||
ret = raw_res_spin_lock_irqsave(&b->raw_lock, flags);
|
||||
if (ret)
|
||||
return ret;
|
||||
*pflags = flags;
|
||||
|
||||
return 0;
|
||||
}
|
||||
|
||||
static inline void htab_unlock_bucket(const struct bpf_htab *htab,
|
||||
struct bucket *b, u32 hash,
|
||||
unsigned long flags)
|
||||
static inline void htab_unlock_bucket(struct bucket *b, unsigned long flags)
|
||||
{
|
||||
hash = hash & min_t(u32, HASHTAB_MAP_LOCK_MASK, htab->n_buckets - 1);
|
||||
raw_spin_unlock(&b->raw_lock);
|
||||
__this_cpu_dec(*(htab->map_locked[hash]));
|
||||
local_irq_restore(flags);
|
||||
preempt_enable();
|
||||
raw_res_spin_unlock_irqrestore(&b->raw_lock, flags);
|
||||
}
|
||||
|
||||
static bool htab_lru_map_delete_node(void *arg, struct bpf_lru_node *node);
|
||||
|
@ -483,14 +463,12 @@ static struct bpf_map *htab_map_alloc(union bpf_attr *attr)
|
|||
bool percpu_lru = (attr->map_flags & BPF_F_NO_COMMON_LRU);
|
||||
bool prealloc = !(attr->map_flags & BPF_F_NO_PREALLOC);
|
||||
struct bpf_htab *htab;
|
||||
int err, i;
|
||||
int err;
|
||||
|
||||
htab = bpf_map_area_alloc(sizeof(*htab), NUMA_NO_NODE);
|
||||
if (!htab)
|
||||
return ERR_PTR(-ENOMEM);
|
||||
|
||||
lockdep_register_key(&htab->lockdep_key);
|
||||
|
||||
bpf_map_init_from_attr(&htab->map, attr);
|
||||
|
||||
if (percpu_lru) {
|
||||
|
@ -536,15 +514,6 @@ static struct bpf_map *htab_map_alloc(union bpf_attr *attr)
|
|||
if (!htab->buckets)
|
||||
goto free_elem_count;
|
||||
|
||||
for (i = 0; i < HASHTAB_MAP_LOCK_COUNT; i++) {
|
||||
htab->map_locked[i] = bpf_map_alloc_percpu(&htab->map,
|
||||
sizeof(int),
|
||||
sizeof(int),
|
||||
GFP_USER);
|
||||
if (!htab->map_locked[i])
|
||||
goto free_map_locked;
|
||||
}
|
||||
|
||||
if (htab->map.map_flags & BPF_F_ZERO_SEED)
|
||||
htab->hashrnd = 0;
|
||||
else
|
||||
|
@ -607,15 +576,12 @@ free_prealloc:
|
|||
free_map_locked:
|
||||
if (htab->use_percpu_counter)
|
||||
percpu_counter_destroy(&htab->pcount);
|
||||
for (i = 0; i < HASHTAB_MAP_LOCK_COUNT; i++)
|
||||
free_percpu(htab->map_locked[i]);
|
||||
bpf_map_area_free(htab->buckets);
|
||||
bpf_mem_alloc_destroy(&htab->pcpu_ma);
|
||||
bpf_mem_alloc_destroy(&htab->ma);
|
||||
free_elem_count:
|
||||
bpf_map_free_elem_count(&htab->map);
|
||||
free_htab:
|
||||
lockdep_unregister_key(&htab->lockdep_key);
|
||||
bpf_map_area_free(htab);
|
||||
return ERR_PTR(err);
|
||||
}
|
||||
|
@ -820,7 +786,7 @@ static bool htab_lru_map_delete_node(void *arg, struct bpf_lru_node *node)
|
|||
b = __select_bucket(htab, tgt_l->hash);
|
||||
head = &b->head;
|
||||
|
||||
ret = htab_lock_bucket(htab, b, tgt_l->hash, &flags);
|
||||
ret = htab_lock_bucket(b, &flags);
|
||||
if (ret)
|
||||
return false;
|
||||
|
||||
|
@ -831,7 +797,7 @@ static bool htab_lru_map_delete_node(void *arg, struct bpf_lru_node *node)
|
|||
break;
|
||||
}
|
||||
|
||||
htab_unlock_bucket(htab, b, tgt_l->hash, flags);
|
||||
htab_unlock_bucket(b, flags);
|
||||
|
||||
if (l == tgt_l)
|
||||
check_and_free_fields(htab, l);
|
||||
|
@ -1150,7 +1116,7 @@ static long htab_map_update_elem(struct bpf_map *map, void *key, void *value,
|
|||
*/
|
||||
}
|
||||
|
||||
ret = htab_lock_bucket(htab, b, hash, &flags);
|
||||
ret = htab_lock_bucket(b, &flags);
|
||||
if (ret)
|
||||
return ret;
|
||||
|
||||
|
@ -1201,7 +1167,7 @@ static long htab_map_update_elem(struct bpf_map *map, void *key, void *value,
|
|||
check_and_free_fields(htab, l_old);
|
||||
}
|
||||
}
|
||||
htab_unlock_bucket(htab, b, hash, flags);
|
||||
htab_unlock_bucket(b, flags);
|
||||
if (l_old) {
|
||||
if (old_map_ptr)
|
||||
map->ops->map_fd_put_ptr(map, old_map_ptr, true);
|
||||
|
@ -1210,7 +1176,7 @@ static long htab_map_update_elem(struct bpf_map *map, void *key, void *value,
|
|||
}
|
||||
return 0;
|
||||
err:
|
||||
htab_unlock_bucket(htab, b, hash, flags);
|
||||
htab_unlock_bucket(b, flags);
|
||||
return ret;
|
||||
}
|
||||
|
||||
|
@ -1257,7 +1223,7 @@ static long htab_lru_map_update_elem(struct bpf_map *map, void *key, void *value
|
|||
copy_map_value(&htab->map,
|
||||
l_new->key + round_up(map->key_size, 8), value);
|
||||
|
||||
ret = htab_lock_bucket(htab, b, hash, &flags);
|
||||
ret = htab_lock_bucket(b, &flags);
|
||||
if (ret)
|
||||
goto err_lock_bucket;
|
||||
|
||||
|
@ -1278,7 +1244,7 @@ static long htab_lru_map_update_elem(struct bpf_map *map, void *key, void *value
|
|||
ret = 0;
|
||||
|
||||
err:
|
||||
htab_unlock_bucket(htab, b, hash, flags);
|
||||
htab_unlock_bucket(b, flags);
|
||||
|
||||
err_lock_bucket:
|
||||
if (ret)
|
||||
|
@ -1315,7 +1281,7 @@ static long __htab_percpu_map_update_elem(struct bpf_map *map, void *key,
|
|||
b = __select_bucket(htab, hash);
|
||||
head = &b->head;
|
||||
|
||||
ret = htab_lock_bucket(htab, b, hash, &flags);
|
||||
ret = htab_lock_bucket(b, &flags);
|
||||
if (ret)
|
||||
return ret;
|
||||
|
||||
|
@ -1340,7 +1306,7 @@ static long __htab_percpu_map_update_elem(struct bpf_map *map, void *key,
|
|||
}
|
||||
ret = 0;
|
||||
err:
|
||||
htab_unlock_bucket(htab, b, hash, flags);
|
||||
htab_unlock_bucket(b, flags);
|
||||
return ret;
|
||||
}
|
||||
|
||||
|
@ -1381,7 +1347,7 @@ static long __htab_lru_percpu_map_update_elem(struct bpf_map *map, void *key,
|
|||
return -ENOMEM;
|
||||
}
|
||||
|
||||
ret = htab_lock_bucket(htab, b, hash, &flags);
|
||||
ret = htab_lock_bucket(b, &flags);
|
||||
if (ret)
|
||||
goto err_lock_bucket;
|
||||
|
||||
|
@ -1405,7 +1371,7 @@ static long __htab_lru_percpu_map_update_elem(struct bpf_map *map, void *key,
|
|||
}
|
||||
ret = 0;
|
||||
err:
|
||||
htab_unlock_bucket(htab, b, hash, flags);
|
||||
htab_unlock_bucket(b, flags);
|
||||
err_lock_bucket:
|
||||
if (l_new) {
|
||||
bpf_map_dec_elem_count(&htab->map);
|
||||
|
@ -1447,7 +1413,7 @@ static long htab_map_delete_elem(struct bpf_map *map, void *key)
|
|||
b = __select_bucket(htab, hash);
|
||||
head = &b->head;
|
||||
|
||||
ret = htab_lock_bucket(htab, b, hash, &flags);
|
||||
ret = htab_lock_bucket(b, &flags);
|
||||
if (ret)
|
||||
return ret;
|
||||
|
||||
|
@ -1457,7 +1423,7 @@ static long htab_map_delete_elem(struct bpf_map *map, void *key)
|
|||
else
|
||||
ret = -ENOENT;
|
||||
|
||||
htab_unlock_bucket(htab, b, hash, flags);
|
||||
htab_unlock_bucket(b, flags);
|
||||
|
||||
if (l)
|
||||
free_htab_elem(htab, l);
|
||||
|
@ -1483,7 +1449,7 @@ static long htab_lru_map_delete_elem(struct bpf_map *map, void *key)
|
|||
b = __select_bucket(htab, hash);
|
||||
head = &b->head;
|
||||
|
||||
ret = htab_lock_bucket(htab, b, hash, &flags);
|
||||
ret = htab_lock_bucket(b, &flags);
|
||||
if (ret)
|
||||
return ret;
|
||||
|
||||
|
@ -1494,7 +1460,7 @@ static long htab_lru_map_delete_elem(struct bpf_map *map, void *key)
|
|||
else
|
||||
ret = -ENOENT;
|
||||
|
||||
htab_unlock_bucket(htab, b, hash, flags);
|
||||
htab_unlock_bucket(b, flags);
|
||||
if (l)
|
||||
htab_lru_push_free(htab, l);
|
||||
return ret;
|
||||
|
@ -1561,7 +1527,6 @@ static void htab_map_free_timers_and_wq(struct bpf_map *map)
|
|||
static void htab_map_free(struct bpf_map *map)
|
||||
{
|
||||
struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
|
||||
int i;
|
||||
|
||||
/* bpf_free_used_maps() or close(map_fd) will trigger this map_free callback.
|
||||
* bpf_free_used_maps() is called after bpf prog is no longer executing.
|
||||
|
@ -1586,9 +1551,6 @@ static void htab_map_free(struct bpf_map *map)
|
|||
bpf_mem_alloc_destroy(&htab->ma);
|
||||
if (htab->use_percpu_counter)
|
||||
percpu_counter_destroy(&htab->pcount);
|
||||
for (i = 0; i < HASHTAB_MAP_LOCK_COUNT; i++)
|
||||
free_percpu(htab->map_locked[i]);
|
||||
lockdep_unregister_key(&htab->lockdep_key);
|
||||
bpf_map_area_free(htab);
|
||||
}
|
||||
|
||||
|
@ -1631,7 +1593,7 @@ static int __htab_map_lookup_and_delete_elem(struct bpf_map *map, void *key,
|
|||
b = __select_bucket(htab, hash);
|
||||
head = &b->head;
|
||||
|
||||
ret = htab_lock_bucket(htab, b, hash, &bflags);
|
||||
ret = htab_lock_bucket(b, &bflags);
|
||||
if (ret)
|
||||
return ret;
|
||||
|
||||
|
@ -1668,7 +1630,7 @@ static int __htab_map_lookup_and_delete_elem(struct bpf_map *map, void *key,
|
|||
hlist_nulls_del_rcu(&l->hash_node);
|
||||
|
||||
out_unlock:
|
||||
htab_unlock_bucket(htab, b, hash, bflags);
|
||||
htab_unlock_bucket(b, bflags);
|
||||
|
||||
if (l) {
|
||||
if (is_lru_map)
|
||||
|
@ -1790,7 +1752,7 @@ again_nocopy:
|
|||
head = &b->head;
|
||||
/* do not grab the lock unless need it (bucket_cnt > 0). */
|
||||
if (locked) {
|
||||
ret = htab_lock_bucket(htab, b, batch, &flags);
|
||||
ret = htab_lock_bucket(b, &flags);
|
||||
if (ret) {
|
||||
rcu_read_unlock();
|
||||
bpf_enable_instrumentation();
|
||||
|
@ -1813,7 +1775,7 @@ again_nocopy:
|
|||
/* Note that since bucket_cnt > 0 here, it is implicit
|
||||
* that the locked was grabbed, so release it.
|
||||
*/
|
||||
htab_unlock_bucket(htab, b, batch, flags);
|
||||
htab_unlock_bucket(b, flags);
|
||||
rcu_read_unlock();
|
||||
bpf_enable_instrumentation();
|
||||
goto after_loop;
|
||||
|
@ -1824,7 +1786,7 @@ again_nocopy:
|
|||
/* Note that since bucket_cnt > 0 here, it is implicit
|
||||
* that the locked was grabbed, so release it.
|
||||
*/
|
||||
htab_unlock_bucket(htab, b, batch, flags);
|
||||
htab_unlock_bucket(b, flags);
|
||||
rcu_read_unlock();
|
||||
bpf_enable_instrumentation();
|
||||
kvfree(keys);
|
||||
|
@ -1887,7 +1849,7 @@ again_nocopy:
|
|||
dst_val += value_size;
|
||||
}
|
||||
|
||||
htab_unlock_bucket(htab, b, batch, flags);
|
||||
htab_unlock_bucket(b, flags);
|
||||
locked = false;
|
||||
|
||||
while (node_to_free) {
|
||||
|
|
|
@ -15,6 +15,7 @@
|
|||
#include <net/ipv6.h>
|
||||
#include <uapi/linux/btf.h>
|
||||
#include <linux/btf_ids.h>
|
||||
#include <asm/rqspinlock.h>
|
||||
#include <linux/bpf_mem_alloc.h>
|
||||
|
||||
/* Intermediate node */
|
||||
|
@ -36,7 +37,7 @@ struct lpm_trie {
|
|||
size_t n_entries;
|
||||
size_t max_prefixlen;
|
||||
size_t data_size;
|
||||
raw_spinlock_t lock;
|
||||
rqspinlock_t lock;
|
||||
};
|
||||
|
||||
/* This trie implements a longest prefix match algorithm that can be used to
|
||||
|
@ -342,7 +343,9 @@ static long trie_update_elem(struct bpf_map *map,
|
|||
if (!new_node)
|
||||
return -ENOMEM;
|
||||
|
||||
raw_spin_lock_irqsave(&trie->lock, irq_flags);
|
||||
ret = raw_res_spin_lock_irqsave(&trie->lock, irq_flags);
|
||||
if (ret)
|
||||
goto out_free;
|
||||
|
||||
new_node->prefixlen = key->prefixlen;
|
||||
RCU_INIT_POINTER(new_node->child[0], NULL);
|
||||
|
@ -356,8 +359,7 @@ static long trie_update_elem(struct bpf_map *map,
|
|||
*/
|
||||
slot = &trie->root;
|
||||
|
||||
while ((node = rcu_dereference_protected(*slot,
|
||||
lockdep_is_held(&trie->lock)))) {
|
||||
while ((node = rcu_dereference(*slot))) {
|
||||
matchlen = longest_prefix_match(trie, node, key);
|
||||
|
||||
if (node->prefixlen != matchlen ||
|
||||
|
@ -442,8 +444,8 @@ static long trie_update_elem(struct bpf_map *map,
|
|||
rcu_assign_pointer(*slot, im_node);
|
||||
|
||||
out:
|
||||
raw_spin_unlock_irqrestore(&trie->lock, irq_flags);
|
||||
|
||||
raw_res_spin_unlock_irqrestore(&trie->lock, irq_flags);
|
||||
out_free:
|
||||
if (ret)
|
||||
bpf_mem_cache_free(&trie->ma, new_node);
|
||||
bpf_mem_cache_free_rcu(&trie->ma, free_node);
|
||||
|
@ -467,7 +469,9 @@ static long trie_delete_elem(struct bpf_map *map, void *_key)
|
|||
if (key->prefixlen > trie->max_prefixlen)
|
||||
return -EINVAL;
|
||||
|
||||
raw_spin_lock_irqsave(&trie->lock, irq_flags);
|
||||
ret = raw_res_spin_lock_irqsave(&trie->lock, irq_flags);
|
||||
if (ret)
|
||||
return ret;
|
||||
|
||||
/* Walk the tree looking for an exact key/length match and keeping
|
||||
* track of the path we traverse. We will need to know the node
|
||||
|
@ -478,8 +482,7 @@ static long trie_delete_elem(struct bpf_map *map, void *_key)
|
|||
trim = &trie->root;
|
||||
trim2 = trim;
|
||||
parent = NULL;
|
||||
while ((node = rcu_dereference_protected(
|
||||
*trim, lockdep_is_held(&trie->lock)))) {
|
||||
while ((node = rcu_dereference(*trim))) {
|
||||
matchlen = longest_prefix_match(trie, node, key);
|
||||
|
||||
if (node->prefixlen != matchlen ||
|
||||
|
@ -543,7 +546,7 @@ static long trie_delete_elem(struct bpf_map *map, void *_key)
|
|||
free_node = node;
|
||||
|
||||
out:
|
||||
raw_spin_unlock_irqrestore(&trie->lock, irq_flags);
|
||||
raw_res_spin_unlock_irqrestore(&trie->lock, irq_flags);
|
||||
|
||||
bpf_mem_cache_free_rcu(&trie->ma, free_parent);
|
||||
bpf_mem_cache_free_rcu(&trie->ma, free_node);
|
||||
|
@ -592,7 +595,7 @@ static struct bpf_map *trie_alloc(union bpf_attr *attr)
|
|||
offsetof(struct bpf_lpm_trie_key_u8, data);
|
||||
trie->max_prefixlen = trie->data_size * 8;
|
||||
|
||||
raw_spin_lock_init(&trie->lock);
|
||||
raw_res_spin_lock_init(&trie->lock);
|
||||
|
||||
/* Allocate intermediate and leaf nodes from the same allocator */
|
||||
leaf_size = sizeof(struct lpm_trie_node) + trie->data_size +
|
||||
|
|
|
@ -14,11 +14,9 @@ int pcpu_freelist_init(struct pcpu_freelist *s)
|
|||
for_each_possible_cpu(cpu) {
|
||||
struct pcpu_freelist_head *head = per_cpu_ptr(s->freelist, cpu);
|
||||
|
||||
raw_spin_lock_init(&head->lock);
|
||||
raw_res_spin_lock_init(&head->lock);
|
||||
head->first = NULL;
|
||||
}
|
||||
raw_spin_lock_init(&s->extralist.lock);
|
||||
s->extralist.first = NULL;
|
||||
return 0;
|
||||
}
|
||||
|
||||
|
@ -34,56 +32,37 @@ static inline void pcpu_freelist_push_node(struct pcpu_freelist_head *head,
|
|||
WRITE_ONCE(head->first, node);
|
||||
}
|
||||
|
||||
static inline void ___pcpu_freelist_push(struct pcpu_freelist_head *head,
|
||||
static inline bool ___pcpu_freelist_push(struct pcpu_freelist_head *head,
|
||||
struct pcpu_freelist_node *node)
|
||||
{
|
||||
raw_spin_lock(&head->lock);
|
||||
pcpu_freelist_push_node(head, node);
|
||||
raw_spin_unlock(&head->lock);
|
||||
}
|
||||
|
||||
static inline bool pcpu_freelist_try_push_extra(struct pcpu_freelist *s,
|
||||
struct pcpu_freelist_node *node)
|
||||
{
|
||||
if (!raw_spin_trylock(&s->extralist.lock))
|
||||
if (raw_res_spin_lock(&head->lock))
|
||||
return false;
|
||||
|
||||
pcpu_freelist_push_node(&s->extralist, node);
|
||||
raw_spin_unlock(&s->extralist.lock);
|
||||
pcpu_freelist_push_node(head, node);
|
||||
raw_res_spin_unlock(&head->lock);
|
||||
return true;
|
||||
}
|
||||
|
||||
static inline void ___pcpu_freelist_push_nmi(struct pcpu_freelist *s,
|
||||
struct pcpu_freelist_node *node)
|
||||
{
|
||||
int cpu, orig_cpu;
|
||||
|
||||
orig_cpu = raw_smp_processor_id();
|
||||
while (1) {
|
||||
for_each_cpu_wrap(cpu, cpu_possible_mask, orig_cpu) {
|
||||
struct pcpu_freelist_head *head;
|
||||
|
||||
head = per_cpu_ptr(s->freelist, cpu);
|
||||
if (raw_spin_trylock(&head->lock)) {
|
||||
pcpu_freelist_push_node(head, node);
|
||||
raw_spin_unlock(&head->lock);
|
||||
return;
|
||||
}
|
||||
}
|
||||
|
||||
/* cannot lock any per cpu lock, try extralist */
|
||||
if (pcpu_freelist_try_push_extra(s, node))
|
||||
return;
|
||||
}
|
||||
}
|
||||
|
||||
void __pcpu_freelist_push(struct pcpu_freelist *s,
|
||||
struct pcpu_freelist_node *node)
|
||||
{
|
||||
if (in_nmi())
|
||||
___pcpu_freelist_push_nmi(s, node);
|
||||
else
|
||||
___pcpu_freelist_push(this_cpu_ptr(s->freelist), node);
|
||||
struct pcpu_freelist_head *head;
|
||||
int cpu;
|
||||
|
||||
if (___pcpu_freelist_push(this_cpu_ptr(s->freelist), node))
|
||||
return;
|
||||
|
||||
while (true) {
|
||||
for_each_cpu_wrap(cpu, cpu_possible_mask, raw_smp_processor_id()) {
|
||||
if (cpu == raw_smp_processor_id())
|
||||
continue;
|
||||
head = per_cpu_ptr(s->freelist, cpu);
|
||||
if (raw_res_spin_lock(&head->lock))
|
||||
continue;
|
||||
pcpu_freelist_push_node(head, node);
|
||||
raw_res_spin_unlock(&head->lock);
|
||||
return;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
void pcpu_freelist_push(struct pcpu_freelist *s,
|
||||
|
@ -120,71 +99,29 @@ void pcpu_freelist_populate(struct pcpu_freelist *s, void *buf, u32 elem_size,
|
|||
|
||||
static struct pcpu_freelist_node *___pcpu_freelist_pop(struct pcpu_freelist *s)
|
||||
{
|
||||
struct pcpu_freelist_node *node = NULL;
|
||||
struct pcpu_freelist_head *head;
|
||||
struct pcpu_freelist_node *node;
|
||||
int cpu;
|
||||
|
||||
for_each_cpu_wrap(cpu, cpu_possible_mask, raw_smp_processor_id()) {
|
||||
head = per_cpu_ptr(s->freelist, cpu);
|
||||
if (!READ_ONCE(head->first))
|
||||
continue;
|
||||
raw_spin_lock(&head->lock);
|
||||
if (raw_res_spin_lock(&head->lock))
|
||||
continue;
|
||||
node = head->first;
|
||||
if (node) {
|
||||
WRITE_ONCE(head->first, node->next);
|
||||
raw_spin_unlock(&head->lock);
|
||||
raw_res_spin_unlock(&head->lock);
|
||||
return node;
|
||||
}
|
||||
raw_spin_unlock(&head->lock);
|
||||
raw_res_spin_unlock(&head->lock);
|
||||
}
|
||||
|
||||
/* per cpu lists are all empty, try extralist */
|
||||
if (!READ_ONCE(s->extralist.first))
|
||||
return NULL;
|
||||
raw_spin_lock(&s->extralist.lock);
|
||||
node = s->extralist.first;
|
||||
if (node)
|
||||
WRITE_ONCE(s->extralist.first, node->next);
|
||||
raw_spin_unlock(&s->extralist.lock);
|
||||
return node;
|
||||
}
|
||||
|
||||
static struct pcpu_freelist_node *
|
||||
___pcpu_freelist_pop_nmi(struct pcpu_freelist *s)
|
||||
{
|
||||
struct pcpu_freelist_head *head;
|
||||
struct pcpu_freelist_node *node;
|
||||
int cpu;
|
||||
|
||||
for_each_cpu_wrap(cpu, cpu_possible_mask, raw_smp_processor_id()) {
|
||||
head = per_cpu_ptr(s->freelist, cpu);
|
||||
if (!READ_ONCE(head->first))
|
||||
continue;
|
||||
if (raw_spin_trylock(&head->lock)) {
|
||||
node = head->first;
|
||||
if (node) {
|
||||
WRITE_ONCE(head->first, node->next);
|
||||
raw_spin_unlock(&head->lock);
|
||||
return node;
|
||||
}
|
||||
raw_spin_unlock(&head->lock);
|
||||
}
|
||||
}
|
||||
|
||||
/* cannot pop from per cpu lists, try extralist */
|
||||
if (!READ_ONCE(s->extralist.first) || !raw_spin_trylock(&s->extralist.lock))
|
||||
return NULL;
|
||||
node = s->extralist.first;
|
||||
if (node)
|
||||
WRITE_ONCE(s->extralist.first, node->next);
|
||||
raw_spin_unlock(&s->extralist.lock);
|
||||
return node;
|
||||
}
|
||||
|
||||
struct pcpu_freelist_node *__pcpu_freelist_pop(struct pcpu_freelist *s)
|
||||
{
|
||||
if (in_nmi())
|
||||
return ___pcpu_freelist_pop_nmi(s);
|
||||
return ___pcpu_freelist_pop(s);
|
||||
}
|
||||
|
||||
|
|
|
@ -5,15 +5,15 @@
|
|||
#define __PERCPU_FREELIST_H__
|
||||
#include <linux/spinlock.h>
|
||||
#include <linux/percpu.h>
|
||||
#include <asm/rqspinlock.h>
|
||||
|
||||
struct pcpu_freelist_head {
|
||||
struct pcpu_freelist_node *first;
|
||||
raw_spinlock_t lock;
|
||||
rqspinlock_t lock;
|
||||
};
|
||||
|
||||
struct pcpu_freelist {
|
||||
struct pcpu_freelist_head __percpu *freelist;
|
||||
struct pcpu_freelist_head extralist;
|
||||
};
|
||||
|
||||
struct pcpu_freelist_node {
|
||||
|
|
737
kernel/bpf/rqspinlock.c
Normal file
737
kernel/bpf/rqspinlock.c
Normal file
|
@ -0,0 +1,737 @@
|
|||
// SPDX-License-Identifier: GPL-2.0-or-later
|
||||
/*
|
||||
* Resilient Queued Spin Lock
|
||||
*
|
||||
* (C) Copyright 2013-2015 Hewlett-Packard Development Company, L.P.
|
||||
* (C) Copyright 2013-2014,2018 Red Hat, Inc.
|
||||
* (C) Copyright 2015 Intel Corp.
|
||||
* (C) Copyright 2015 Hewlett-Packard Enterprise Development LP
|
||||
* (C) Copyright 2024-2025 Meta Platforms, Inc. and affiliates.
|
||||
*
|
||||
* Authors: Waiman Long <longman@redhat.com>
|
||||
* Peter Zijlstra <peterz@infradead.org>
|
||||
* Kumar Kartikeya Dwivedi <memxor@gmail.com>
|
||||
*/
|
||||
|
||||
#include <linux/smp.h>
|
||||
#include <linux/bug.h>
|
||||
#include <linux/bpf.h>
|
||||
#include <linux/err.h>
|
||||
#include <linux/cpumask.h>
|
||||
#include <linux/percpu.h>
|
||||
#include <linux/hardirq.h>
|
||||
#include <linux/mutex.h>
|
||||
#include <linux/prefetch.h>
|
||||
#include <asm/byteorder.h>
|
||||
#ifdef CONFIG_QUEUED_SPINLOCKS
|
||||
#include <asm/qspinlock.h>
|
||||
#endif
|
||||
#include <trace/events/lock.h>
|
||||
#include <asm/rqspinlock.h>
|
||||
#include <linux/timekeeping.h>
|
||||
|
||||
/*
|
||||
* Include queued spinlock definitions and statistics code
|
||||
*/
|
||||
#ifdef CONFIG_QUEUED_SPINLOCKS
|
||||
#include "../locking/qspinlock.h"
|
||||
#include "../locking/lock_events.h"
|
||||
#include "rqspinlock.h"
|
||||
#include "../locking/mcs_spinlock.h"
|
||||
#endif
|
||||
|
||||
/*
|
||||
* The basic principle of a queue-based spinlock can best be understood
|
||||
* by studying a classic queue-based spinlock implementation called the
|
||||
* MCS lock. A copy of the original MCS lock paper ("Algorithms for Scalable
|
||||
* Synchronization on Shared-Memory Multiprocessors by Mellor-Crummey and
|
||||
* Scott") is available at
|
||||
*
|
||||
* https://bugzilla.kernel.org/show_bug.cgi?id=206115
|
||||
*
|
||||
* This queued spinlock implementation is based on the MCS lock, however to
|
||||
* make it fit the 4 bytes we assume spinlock_t to be, and preserve its
|
||||
* existing API, we must modify it somehow.
|
||||
*
|
||||
* In particular; where the traditional MCS lock consists of a tail pointer
|
||||
* (8 bytes) and needs the next pointer (another 8 bytes) of its own node to
|
||||
* unlock the next pending (next->locked), we compress both these: {tail,
|
||||
* next->locked} into a single u32 value.
|
||||
*
|
||||
* Since a spinlock disables recursion of its own context and there is a limit
|
||||
* to the contexts that can nest; namely: task, softirq, hardirq, nmi. As there
|
||||
* are at most 4 nesting levels, it can be encoded by a 2-bit number. Now
|
||||
* we can encode the tail by combining the 2-bit nesting level with the cpu
|
||||
* number. With one byte for the lock value and 3 bytes for the tail, only a
|
||||
* 32-bit word is now needed. Even though we only need 1 bit for the lock,
|
||||
* we extend it to a full byte to achieve better performance for architectures
|
||||
* that support atomic byte write.
|
||||
*
|
||||
* We also change the first spinner to spin on the lock bit instead of its
|
||||
* node; whereby avoiding the need to carry a node from lock to unlock, and
|
||||
* preserving existing lock API. This also makes the unlock code simpler and
|
||||
* faster.
|
||||
*
|
||||
* N.B. The current implementation only supports architectures that allow
|
||||
* atomic operations on smaller 8-bit and 16-bit data types.
|
||||
*
|
||||
*/
|
||||
|
||||
struct rqspinlock_timeout {
|
||||
u64 timeout_end;
|
||||
u64 duration;
|
||||
u64 cur;
|
||||
u16 spin;
|
||||
};
|
||||
|
||||
#define RES_TIMEOUT_VAL 2
|
||||
|
||||
DEFINE_PER_CPU_ALIGNED(struct rqspinlock_held, rqspinlock_held_locks);
|
||||
EXPORT_SYMBOL_GPL(rqspinlock_held_locks);
|
||||
|
||||
static bool is_lock_released(rqspinlock_t *lock, u32 mask, struct rqspinlock_timeout *ts)
|
||||
{
|
||||
if (!(atomic_read_acquire(&lock->val) & (mask)))
|
||||
return true;
|
||||
return false;
|
||||
}
|
||||
|
||||
static noinline int check_deadlock_AA(rqspinlock_t *lock, u32 mask,
|
||||
struct rqspinlock_timeout *ts)
|
||||
{
|
||||
struct rqspinlock_held *rqh = this_cpu_ptr(&rqspinlock_held_locks);
|
||||
int cnt = min(RES_NR_HELD, rqh->cnt);
|
||||
|
||||
/*
|
||||
* Return an error if we hold the lock we are attempting to acquire.
|
||||
* We'll iterate over max 32 locks; no need to do is_lock_released.
|
||||
*/
|
||||
for (int i = 0; i < cnt - 1; i++) {
|
||||
if (rqh->locks[i] == lock)
|
||||
return -EDEADLK;
|
||||
}
|
||||
return 0;
|
||||
}
|
||||
|
||||
/*
|
||||
* This focuses on the most common case of ABBA deadlocks (or ABBA involving
|
||||
* more locks, which reduce to ABBA). This is not exhaustive, and we rely on
|
||||
* timeouts as the final line of defense.
|
||||
*/
|
||||
static noinline int check_deadlock_ABBA(rqspinlock_t *lock, u32 mask,
|
||||
struct rqspinlock_timeout *ts)
|
||||
{
|
||||
struct rqspinlock_held *rqh = this_cpu_ptr(&rqspinlock_held_locks);
|
||||
int rqh_cnt = min(RES_NR_HELD, rqh->cnt);
|
||||
void *remote_lock;
|
||||
int cpu;
|
||||
|
||||
/*
|
||||
* Find the CPU holding the lock that we want to acquire. If there is a
|
||||
* deadlock scenario, we will read a stable set on the remote CPU and
|
||||
* find the target. This would be a constant time operation instead of
|
||||
* O(NR_CPUS) if we could determine the owning CPU from a lock value, but
|
||||
* that requires increasing the size of the lock word.
|
||||
*/
|
||||
for_each_possible_cpu(cpu) {
|
||||
struct rqspinlock_held *rqh_cpu = per_cpu_ptr(&rqspinlock_held_locks, cpu);
|
||||
int real_cnt = READ_ONCE(rqh_cpu->cnt);
|
||||
int cnt = min(RES_NR_HELD, real_cnt);
|
||||
|
||||
/*
|
||||
* Let's ensure to break out of this loop if the lock is available for
|
||||
* us to potentially acquire.
|
||||
*/
|
||||
if (is_lock_released(lock, mask, ts))
|
||||
return 0;
|
||||
|
||||
/*
|
||||
* Skip ourselves, and CPUs whose count is less than 2, as they need at
|
||||
* least one held lock and one acquisition attempt (reflected as top
|
||||
* most entry) to participate in an ABBA deadlock.
|
||||
*
|
||||
* If cnt is more than RES_NR_HELD, it means the current lock being
|
||||
* acquired won't appear in the table, and other locks in the table are
|
||||
* already held, so we can't determine ABBA.
|
||||
*/
|
||||
if (cpu == smp_processor_id() || real_cnt < 2 || real_cnt > RES_NR_HELD)
|
||||
continue;
|
||||
|
||||
/*
|
||||
* Obtain the entry at the top, this corresponds to the lock the
|
||||
* remote CPU is attempting to acquire in a deadlock situation,
|
||||
* and would be one of the locks we hold on the current CPU.
|
||||
*/
|
||||
remote_lock = READ_ONCE(rqh_cpu->locks[cnt - 1]);
|
||||
/*
|
||||
* If it is NULL, we've raced and cannot determine a deadlock
|
||||
* conclusively, skip this CPU.
|
||||
*/
|
||||
if (!remote_lock)
|
||||
continue;
|
||||
/*
|
||||
* Find if the lock we're attempting to acquire is held by this CPU.
|
||||
* Don't consider the topmost entry, as that must be the latest lock
|
||||
* being held or acquired. For a deadlock, the target CPU must also
|
||||
* attempt to acquire a lock we hold, so for this search only 'cnt - 1'
|
||||
* entries are important.
|
||||
*/
|
||||
for (int i = 0; i < cnt - 1; i++) {
|
||||
if (READ_ONCE(rqh_cpu->locks[i]) != lock)
|
||||
continue;
|
||||
/*
|
||||
* We found our lock as held on the remote CPU. Is the
|
||||
* acquisition attempt on the remote CPU for a lock held
|
||||
* by us? If so, we have a deadlock situation, and need
|
||||
* to recover.
|
||||
*/
|
||||
for (int i = 0; i < rqh_cnt - 1; i++) {
|
||||
if (rqh->locks[i] == remote_lock)
|
||||
return -EDEADLK;
|
||||
}
|
||||
/*
|
||||
* Inconclusive; retry again later.
|
||||
*/
|
||||
return 0;
|
||||
}
|
||||
}
|
||||
return 0;
|
||||
}
|
||||
|
||||
static noinline int check_deadlock(rqspinlock_t *lock, u32 mask,
|
||||
struct rqspinlock_timeout *ts)
|
||||
{
|
||||
int ret;
|
||||
|
||||
ret = check_deadlock_AA(lock, mask, ts);
|
||||
if (ret)
|
||||
return ret;
|
||||
ret = check_deadlock_ABBA(lock, mask, ts);
|
||||
if (ret)
|
||||
return ret;
|
||||
|
||||
return 0;
|
||||
}
|
||||
|
||||
static noinline int check_timeout(rqspinlock_t *lock, u32 mask,
|
||||
struct rqspinlock_timeout *ts)
|
||||
{
|
||||
u64 time = ktime_get_mono_fast_ns();
|
||||
u64 prev = ts->cur;
|
||||
|
||||
if (!ts->timeout_end) {
|
||||
ts->cur = time;
|
||||
ts->timeout_end = time + ts->duration;
|
||||
return 0;
|
||||
}
|
||||
|
||||
if (time > ts->timeout_end)
|
||||
return -ETIMEDOUT;
|
||||
|
||||
/*
|
||||
* A millisecond interval passed from last time? Trigger deadlock
|
||||
* checks.
|
||||
*/
|
||||
if (prev + NSEC_PER_MSEC < time) {
|
||||
ts->cur = time;
|
||||
return check_deadlock(lock, mask, ts);
|
||||
}
|
||||
|
||||
return 0;
|
||||
}
|
||||
|
||||
/*
|
||||
* Do not amortize with spins when res_smp_cond_load_acquire is defined,
|
||||
* as the macro does internal amortization for us.
|
||||
*/
|
||||
#ifndef res_smp_cond_load_acquire
|
||||
#define RES_CHECK_TIMEOUT(ts, ret, mask) \
|
||||
({ \
|
||||
if (!(ts).spin++) \
|
||||
(ret) = check_timeout((lock), (mask), &(ts)); \
|
||||
(ret); \
|
||||
})
|
||||
#else
|
||||
#define RES_CHECK_TIMEOUT(ts, ret, mask) \
|
||||
({ (ret) = check_timeout(&(ts)); })
|
||||
#endif
|
||||
|
||||
/*
|
||||
* Initialize the 'spin' member.
|
||||
* Set spin member to 0 to trigger AA/ABBA checks immediately.
|
||||
*/
|
||||
#define RES_INIT_TIMEOUT(ts) ({ (ts).spin = 0; })
|
||||
|
||||
/*
|
||||
* We only need to reset 'timeout_end', 'spin' will just wrap around as necessary.
|
||||
* Duration is defined for each spin attempt, so set it here.
|
||||
*/
|
||||
#define RES_RESET_TIMEOUT(ts, _duration) ({ (ts).timeout_end = 0; (ts).duration = _duration; })
|
||||
|
||||
/*
|
||||
* Provide a test-and-set fallback for cases when queued spin lock support is
|
||||
* absent from the architecture.
|
||||
*/
|
||||
int __lockfunc resilient_tas_spin_lock(rqspinlock_t *lock)
|
||||
{
|
||||
struct rqspinlock_timeout ts;
|
||||
int val, ret = 0;
|
||||
|
||||
RES_INIT_TIMEOUT(ts);
|
||||
grab_held_lock_entry(lock);
|
||||
|
||||
/*
|
||||
* Since the waiting loop's time is dependent on the amount of
|
||||
* contention, a short timeout unlike rqspinlock waiting loops
|
||||
* isn't enough. Choose a second as the timeout value.
|
||||
*/
|
||||
RES_RESET_TIMEOUT(ts, NSEC_PER_SEC);
|
||||
retry:
|
||||
val = atomic_read(&lock->val);
|
||||
|
||||
if (val || !atomic_try_cmpxchg(&lock->val, &val, 1)) {
|
||||
if (RES_CHECK_TIMEOUT(ts, ret, ~0u))
|
||||
goto out;
|
||||
cpu_relax();
|
||||
goto retry;
|
||||
}
|
||||
|
||||
return 0;
|
||||
out:
|
||||
release_held_lock_entry();
|
||||
return ret;
|
||||
}
|
||||
EXPORT_SYMBOL_GPL(resilient_tas_spin_lock);
|
||||
|
||||
#ifdef CONFIG_QUEUED_SPINLOCKS
|
||||
|
||||
/*
|
||||
* Per-CPU queue node structures; we can never have more than 4 nested
|
||||
* contexts: task, softirq, hardirq, nmi.
|
||||
*
|
||||
* Exactly fits one 64-byte cacheline on a 64-bit architecture.
|
||||
*/
|
||||
static DEFINE_PER_CPU_ALIGNED(struct qnode, rqnodes[_Q_MAX_NODES]);
|
||||
|
||||
#ifndef res_smp_cond_load_acquire
|
||||
#define res_smp_cond_load_acquire(v, c) smp_cond_load_acquire(v, c)
|
||||
#endif
|
||||
|
||||
#define res_atomic_cond_read_acquire(v, c) res_smp_cond_load_acquire(&(v)->counter, (c))
|
||||
|
||||
/**
|
||||
* resilient_queued_spin_lock_slowpath - acquire the queued spinlock
|
||||
* @lock: Pointer to queued spinlock structure
|
||||
* @val: Current value of the queued spinlock 32-bit word
|
||||
*
|
||||
* Return:
|
||||
* * 0 - Lock was acquired successfully.
|
||||
* * -EDEADLK - Lock acquisition failed because of AA/ABBA deadlock.
|
||||
* * -ETIMEDOUT - Lock acquisition failed because of timeout.
|
||||
*
|
||||
* (queue tail, pending bit, lock value)
|
||||
*
|
||||
* fast : slow : unlock
|
||||
* : :
|
||||
* uncontended (0,0,0) -:--> (0,0,1) ------------------------------:--> (*,*,0)
|
||||
* : | ^--------.------. / :
|
||||
* : v \ \ | :
|
||||
* pending : (0,1,1) +--> (0,1,0) \ | :
|
||||
* : | ^--' | | :
|
||||
* : v | | :
|
||||
* uncontended : (n,x,y) +--> (n,0,0) --' | :
|
||||
* queue : | ^--' | :
|
||||
* : v | :
|
||||
* contended : (*,x,y) +--> (*,0,0) ---> (*,0,1) -' :
|
||||
* queue : ^--' :
|
||||
*/
|
||||
int __lockfunc resilient_queued_spin_lock_slowpath(rqspinlock_t *lock, u32 val)
|
||||
{
|
||||
struct mcs_spinlock *prev, *next, *node;
|
||||
struct rqspinlock_timeout ts;
|
||||
int idx, ret = 0;
|
||||
u32 old, tail;
|
||||
|
||||
BUILD_BUG_ON(CONFIG_NR_CPUS >= (1U << _Q_TAIL_CPU_BITS));
|
||||
|
||||
if (resilient_virt_spin_lock_enabled())
|
||||
return resilient_virt_spin_lock(lock);
|
||||
|
||||
RES_INIT_TIMEOUT(ts);
|
||||
|
||||
/*
|
||||
* Wait for in-progress pending->locked hand-overs with a bounded
|
||||
* number of spins so that we guarantee forward progress.
|
||||
*
|
||||
* 0,1,0 -> 0,0,1
|
||||
*/
|
||||
if (val == _Q_PENDING_VAL) {
|
||||
int cnt = _Q_PENDING_LOOPS;
|
||||
val = atomic_cond_read_relaxed(&lock->val,
|
||||
(VAL != _Q_PENDING_VAL) || !cnt--);
|
||||
}
|
||||
|
||||
/*
|
||||
* If we observe any contention; queue.
|
||||
*/
|
||||
if (val & ~_Q_LOCKED_MASK)
|
||||
goto queue;
|
||||
|
||||
/*
|
||||
* trylock || pending
|
||||
*
|
||||
* 0,0,* -> 0,1,* -> 0,0,1 pending, trylock
|
||||
*/
|
||||
val = queued_fetch_set_pending_acquire(lock);
|
||||
|
||||
/*
|
||||
* If we observe contention, there is a concurrent locker.
|
||||
*
|
||||
* Undo and queue; our setting of PENDING might have made the
|
||||
* n,0,0 -> 0,0,0 transition fail and it will now be waiting
|
||||
* on @next to become !NULL.
|
||||
*/
|
||||
if (unlikely(val & ~_Q_LOCKED_MASK)) {
|
||||
|
||||
/* Undo PENDING if we set it. */
|
||||
if (!(val & _Q_PENDING_MASK))
|
||||
clear_pending(lock);
|
||||
|
||||
goto queue;
|
||||
}
|
||||
|
||||
/*
|
||||
* Grab an entry in the held locks array, to enable deadlock detection.
|
||||
*/
|
||||
grab_held_lock_entry(lock);
|
||||
|
||||
/*
|
||||
* We're pending, wait for the owner to go away.
|
||||
*
|
||||
* 0,1,1 -> *,1,0
|
||||
*
|
||||
* this wait loop must be a load-acquire such that we match the
|
||||
* store-release that clears the locked bit and create lock
|
||||
* sequentiality; this is because not all
|
||||
* clear_pending_set_locked() implementations imply full
|
||||
* barriers.
|
||||
*/
|
||||
if (val & _Q_LOCKED_MASK) {
|
||||
RES_RESET_TIMEOUT(ts, RES_DEF_TIMEOUT);
|
||||
res_smp_cond_load_acquire(&lock->locked, !VAL || RES_CHECK_TIMEOUT(ts, ret, _Q_LOCKED_MASK));
|
||||
}
|
||||
|
||||
if (ret) {
|
||||
/*
|
||||
* We waited for the locked bit to go back to 0, as the pending
|
||||
* waiter, but timed out. We need to clear the pending bit since
|
||||
* we own it. Once a stuck owner has been recovered, the lock
|
||||
* must be restored to a valid state, hence removing the pending
|
||||
* bit is necessary.
|
||||
*
|
||||
* *,1,* -> *,0,*
|
||||
*/
|
||||
clear_pending(lock);
|
||||
lockevent_inc(rqspinlock_lock_timeout);
|
||||
goto err_release_entry;
|
||||
}
|
||||
|
||||
/*
|
||||
* take ownership and clear the pending bit.
|
||||
*
|
||||
* 0,1,0 -> 0,0,1
|
||||
*/
|
||||
clear_pending_set_locked(lock);
|
||||
lockevent_inc(lock_pending);
|
||||
return 0;
|
||||
|
||||
/*
|
||||
* End of pending bit optimistic spinning and beginning of MCS
|
||||
* queuing.
|
||||
*/
|
||||
queue:
|
||||
lockevent_inc(lock_slowpath);
|
||||
/*
|
||||
* Grab deadlock detection entry for the queue path.
|
||||
*/
|
||||
grab_held_lock_entry(lock);
|
||||
|
||||
node = this_cpu_ptr(&rqnodes[0].mcs);
|
||||
idx = node->count++;
|
||||
tail = encode_tail(smp_processor_id(), idx);
|
||||
|
||||
trace_contention_begin(lock, LCB_F_SPIN);
|
||||
|
||||
/*
|
||||
* 4 nodes are allocated based on the assumption that there will
|
||||
* not be nested NMIs taking spinlocks. That may not be true in
|
||||
* some architectures even though the chance of needing more than
|
||||
* 4 nodes will still be extremely unlikely. When that happens,
|
||||
* we fall back to spinning on the lock directly without using
|
||||
* any MCS node. This is not the most elegant solution, but is
|
||||
* simple enough.
|
||||
*/
|
||||
if (unlikely(idx >= _Q_MAX_NODES)) {
|
||||
lockevent_inc(lock_no_node);
|
||||
RES_RESET_TIMEOUT(ts, RES_DEF_TIMEOUT);
|
||||
while (!queued_spin_trylock(lock)) {
|
||||
if (RES_CHECK_TIMEOUT(ts, ret, ~0u)) {
|
||||
lockevent_inc(rqspinlock_lock_timeout);
|
||||
goto err_release_node;
|
||||
}
|
||||
cpu_relax();
|
||||
}
|
||||
goto release;
|
||||
}
|
||||
|
||||
node = grab_mcs_node(node, idx);
|
||||
|
||||
/*
|
||||
* Keep counts of non-zero index values:
|
||||
*/
|
||||
lockevent_cond_inc(lock_use_node2 + idx - 1, idx);
|
||||
|
||||
/*
|
||||
* Ensure that we increment the head node->count before initialising
|
||||
* the actual node. If the compiler is kind enough to reorder these
|
||||
* stores, then an IRQ could overwrite our assignments.
|
||||
*/
|
||||
barrier();
|
||||
|
||||
node->locked = 0;
|
||||
node->next = NULL;
|
||||
|
||||
/*
|
||||
* We touched a (possibly) cold cacheline in the per-cpu queue node;
|
||||
* attempt the trylock once more in the hope someone let go while we
|
||||
* weren't watching.
|
||||
*/
|
||||
if (queued_spin_trylock(lock))
|
||||
goto release;
|
||||
|
||||
/*
|
||||
* Ensure that the initialisation of @node is complete before we
|
||||
* publish the updated tail via xchg_tail() and potentially link
|
||||
* @node into the waitqueue via WRITE_ONCE(prev->next, node) below.
|
||||
*/
|
||||
smp_wmb();
|
||||
|
||||
/*
|
||||
* Publish the updated tail.
|
||||
* We have already touched the queueing cacheline; don't bother with
|
||||
* pending stuff.
|
||||
*
|
||||
* p,*,* -> n,*,*
|
||||
*/
|
||||
old = xchg_tail(lock, tail);
|
||||
next = NULL;
|
||||
|
||||
/*
|
||||
* if there was a previous node; link it and wait until reaching the
|
||||
* head of the waitqueue.
|
||||
*/
|
||||
if (old & _Q_TAIL_MASK) {
|
||||
int val;
|
||||
|
||||
prev = decode_tail(old, rqnodes);
|
||||
|
||||
/* Link @node into the waitqueue. */
|
||||
WRITE_ONCE(prev->next, node);
|
||||
|
||||
val = arch_mcs_spin_lock_contended(&node->locked);
|
||||
if (val == RES_TIMEOUT_VAL) {
|
||||
ret = -EDEADLK;
|
||||
goto waitq_timeout;
|
||||
}
|
||||
|
||||
/*
|
||||
* While waiting for the MCS lock, the next pointer may have
|
||||
* been set by another lock waiter. We optimistically load
|
||||
* the next pointer & prefetch the cacheline for writing
|
||||
* to reduce latency in the upcoming MCS unlock operation.
|
||||
*/
|
||||
next = READ_ONCE(node->next);
|
||||
if (next)
|
||||
prefetchw(next);
|
||||
}
|
||||
|
||||
/*
|
||||
* we're at the head of the waitqueue, wait for the owner & pending to
|
||||
* go away.
|
||||
*
|
||||
* *,x,y -> *,0,0
|
||||
*
|
||||
* this wait loop must use a load-acquire such that we match the
|
||||
* store-release that clears the locked bit and create lock
|
||||
* sequentiality; this is because the set_locked() function below
|
||||
* does not imply a full barrier.
|
||||
*
|
||||
* We use RES_DEF_TIMEOUT * 2 as the duration, as RES_DEF_TIMEOUT is
|
||||
* meant to span maximum allowed time per critical section, and we may
|
||||
* have both the owner of the lock and the pending bit waiter ahead of
|
||||
* us.
|
||||
*/
|
||||
RES_RESET_TIMEOUT(ts, RES_DEF_TIMEOUT * 2);
|
||||
val = res_atomic_cond_read_acquire(&lock->val, !(VAL & _Q_LOCKED_PENDING_MASK) ||
|
||||
RES_CHECK_TIMEOUT(ts, ret, _Q_LOCKED_PENDING_MASK));
|
||||
|
||||
waitq_timeout:
|
||||
if (ret) {
|
||||
/*
|
||||
* If the tail is still pointing to us, then we are the final waiter,
|
||||
* and are responsible for resetting the tail back to 0. Otherwise, if
|
||||
* the cmpxchg operation fails, we signal the next waiter to take exit
|
||||
* and try the same. For a waiter with tail node 'n':
|
||||
*
|
||||
* n,*,* -> 0,*,*
|
||||
*
|
||||
* When performing cmpxchg for the whole word (NR_CPUS > 16k), it is
|
||||
* possible locked/pending bits keep changing and we see failures even
|
||||
* when we remain the head of wait queue. However, eventually,
|
||||
* pending bit owner will unset the pending bit, and new waiters
|
||||
* will queue behind us. This will leave the lock owner in
|
||||
* charge, and it will eventually either set locked bit to 0, or
|
||||
* leave it as 1, allowing us to make progress.
|
||||
*
|
||||
* We terminate the whole wait queue for two reasons. Firstly,
|
||||
* we eschew per-waiter timeouts with one applied at the head of
|
||||
* the wait queue. This allows everyone to break out faster
|
||||
* once we've seen the owner / pending waiter not responding for
|
||||
* the timeout duration from the head. Secondly, it avoids
|
||||
* complicated synchronization, because when not leaving in FIFO
|
||||
* order, prev's next pointer needs to be fixed up etc.
|
||||
*/
|
||||
if (!try_cmpxchg_tail(lock, tail, 0)) {
|
||||
next = smp_cond_load_relaxed(&node->next, VAL);
|
||||
WRITE_ONCE(next->locked, RES_TIMEOUT_VAL);
|
||||
}
|
||||
lockevent_inc(rqspinlock_lock_timeout);
|
||||
goto err_release_node;
|
||||
}
|
||||
|
||||
/*
|
||||
* claim the lock:
|
||||
*
|
||||
* n,0,0 -> 0,0,1 : lock, uncontended
|
||||
* *,*,0 -> *,*,1 : lock, contended
|
||||
*
|
||||
* If the queue head is the only one in the queue (lock value == tail)
|
||||
* and nobody is pending, clear the tail code and grab the lock.
|
||||
* Otherwise, we only need to grab the lock.
|
||||
*/
|
||||
|
||||
/*
|
||||
* Note: at this point: (val & _Q_PENDING_MASK) == 0, because of the
|
||||
* above wait condition, therefore any concurrent setting of
|
||||
* PENDING will make the uncontended transition fail.
|
||||
*/
|
||||
if ((val & _Q_TAIL_MASK) == tail) {
|
||||
if (atomic_try_cmpxchg_relaxed(&lock->val, &val, _Q_LOCKED_VAL))
|
||||
goto release; /* No contention */
|
||||
}
|
||||
|
||||
/*
|
||||
* Either somebody is queued behind us or _Q_PENDING_VAL got set
|
||||
* which will then detect the remaining tail and queue behind us
|
||||
* ensuring we'll see a @next.
|
||||
*/
|
||||
set_locked(lock);
|
||||
|
||||
/*
|
||||
* contended path; wait for next if not observed yet, release.
|
||||
*/
|
||||
if (!next)
|
||||
next = smp_cond_load_relaxed(&node->next, (VAL));
|
||||
|
||||
arch_mcs_spin_unlock_contended(&next->locked);
|
||||
|
||||
release:
|
||||
trace_contention_end(lock, 0);
|
||||
|
||||
/*
|
||||
* release the node
|
||||
*/
|
||||
__this_cpu_dec(rqnodes[0].mcs.count);
|
||||
return ret;
|
||||
err_release_node:
|
||||
trace_contention_end(lock, ret);
|
||||
__this_cpu_dec(rqnodes[0].mcs.count);
|
||||
err_release_entry:
|
||||
release_held_lock_entry();
|
||||
return ret;
|
||||
}
|
||||
EXPORT_SYMBOL_GPL(resilient_queued_spin_lock_slowpath);
|
||||
|
||||
#endif /* CONFIG_QUEUED_SPINLOCKS */
|
||||
|
||||
__bpf_kfunc_start_defs();
|
||||
|
||||
__bpf_kfunc int bpf_res_spin_lock(struct bpf_res_spin_lock *lock)
|
||||
{
|
||||
int ret;
|
||||
|
||||
BUILD_BUG_ON(sizeof(rqspinlock_t) != sizeof(struct bpf_res_spin_lock));
|
||||
BUILD_BUG_ON(__alignof__(rqspinlock_t) != __alignof__(struct bpf_res_spin_lock));
|
||||
|
||||
preempt_disable();
|
||||
ret = res_spin_lock((rqspinlock_t *)lock);
|
||||
if (unlikely(ret)) {
|
||||
preempt_enable();
|
||||
return ret;
|
||||
}
|
||||
return 0;
|
||||
}
|
||||
|
||||
__bpf_kfunc void bpf_res_spin_unlock(struct bpf_res_spin_lock *lock)
|
||||
{
|
||||
res_spin_unlock((rqspinlock_t *)lock);
|
||||
preempt_enable();
|
||||
}
|
||||
|
||||
__bpf_kfunc int bpf_res_spin_lock_irqsave(struct bpf_res_spin_lock *lock, unsigned long *flags__irq_flag)
|
||||
{
|
||||
u64 *ptr = (u64 *)flags__irq_flag;
|
||||
unsigned long flags;
|
||||
int ret;
|
||||
|
||||
preempt_disable();
|
||||
local_irq_save(flags);
|
||||
ret = res_spin_lock((rqspinlock_t *)lock);
|
||||
if (unlikely(ret)) {
|
||||
local_irq_restore(flags);
|
||||
preempt_enable();
|
||||
return ret;
|
||||
}
|
||||
*ptr = flags;
|
||||
return 0;
|
||||
}
|
||||
|
||||
__bpf_kfunc void bpf_res_spin_unlock_irqrestore(struct bpf_res_spin_lock *lock, unsigned long *flags__irq_flag)
|
||||
{
|
||||
u64 *ptr = (u64 *)flags__irq_flag;
|
||||
unsigned long flags = *ptr;
|
||||
|
||||
res_spin_unlock((rqspinlock_t *)lock);
|
||||
local_irq_restore(flags);
|
||||
preempt_enable();
|
||||
}
|
||||
|
||||
__bpf_kfunc_end_defs();
|
||||
|
||||
BTF_KFUNCS_START(rqspinlock_kfunc_ids)
|
||||
BTF_ID_FLAGS(func, bpf_res_spin_lock, KF_RET_NULL)
|
||||
BTF_ID_FLAGS(func, bpf_res_spin_unlock)
|
||||
BTF_ID_FLAGS(func, bpf_res_spin_lock_irqsave, KF_RET_NULL)
|
||||
BTF_ID_FLAGS(func, bpf_res_spin_unlock_irqrestore)
|
||||
BTF_KFUNCS_END(rqspinlock_kfunc_ids)
|
||||
|
||||
static const struct btf_kfunc_id_set rqspinlock_kfunc_set = {
|
||||
.owner = THIS_MODULE,
|
||||
.set = &rqspinlock_kfunc_ids,
|
||||
};
|
||||
|
||||
static __init int rqspinlock_register_kfuncs(void)
|
||||
{
|
||||
return register_btf_kfunc_id_set(BPF_PROG_TYPE_UNSPEC, &rqspinlock_kfunc_set);
|
||||
}
|
||||
late_initcall(rqspinlock_register_kfuncs);
|
48
kernel/bpf/rqspinlock.h
Normal file
48
kernel/bpf/rqspinlock.h
Normal file
|
@ -0,0 +1,48 @@
|
|||
/* SPDX-License-Identifier: GPL-2.0-or-later */
|
||||
/*
|
||||
* Resilient Queued Spin Lock defines
|
||||
*
|
||||
* (C) Copyright 2024-2025 Meta Platforms, Inc. and affiliates.
|
||||
*
|
||||
* Authors: Kumar Kartikeya Dwivedi <memxor@gmail.com>
|
||||
*/
|
||||
#ifndef __LINUX_RQSPINLOCK_H
|
||||
#define __LINUX_RQSPINLOCK_H
|
||||
|
||||
#include "../locking/qspinlock.h"
|
||||
|
||||
/*
|
||||
* try_cmpxchg_tail - Return result of cmpxchg of tail word with a new value
|
||||
* @lock: Pointer to queued spinlock structure
|
||||
* @tail: The tail to compare against
|
||||
* @new_tail: The new queue tail code word
|
||||
* Return: Bool to indicate whether the cmpxchg operation succeeded
|
||||
*
|
||||
* This is used by the head of the wait queue to clean up the queue.
|
||||
* Provides relaxed ordering, since observers only rely on initialized
|
||||
* state of the node which was made visible through the xchg_tail operation,
|
||||
* i.e. through the smp_wmb preceding xchg_tail.
|
||||
*
|
||||
* We avoid using 16-bit cmpxchg, which is not available on all architectures.
|
||||
*/
|
||||
static __always_inline bool try_cmpxchg_tail(struct qspinlock *lock, u32 tail, u32 new_tail)
|
||||
{
|
||||
u32 old, new;
|
||||
|
||||
old = atomic_read(&lock->val);
|
||||
do {
|
||||
/*
|
||||
* Is the tail part we compare to already stale? Fail.
|
||||
*/
|
||||
if ((old & _Q_TAIL_MASK) != tail)
|
||||
return false;
|
||||
/*
|
||||
* Encode latest locked/pending state for new tail.
|
||||
*/
|
||||
new = (old & _Q_LOCKED_PENDING_MASK) | new_tail;
|
||||
} while (!atomic_try_cmpxchg_relaxed(&lock->val, &old, new));
|
||||
|
||||
return true;
|
||||
}
|
||||
|
||||
#endif /* __LINUX_RQSPINLOCK_H */
|
|
@ -648,6 +648,7 @@ void btf_record_free(struct btf_record *rec)
|
|||
case BPF_RB_ROOT:
|
||||
case BPF_RB_NODE:
|
||||
case BPF_SPIN_LOCK:
|
||||
case BPF_RES_SPIN_LOCK:
|
||||
case BPF_TIMER:
|
||||
case BPF_REFCOUNT:
|
||||
case BPF_WORKQUEUE:
|
||||
|
@ -700,6 +701,7 @@ struct btf_record *btf_record_dup(const struct btf_record *rec)
|
|||
case BPF_RB_ROOT:
|
||||
case BPF_RB_NODE:
|
||||
case BPF_SPIN_LOCK:
|
||||
case BPF_RES_SPIN_LOCK:
|
||||
case BPF_TIMER:
|
||||
case BPF_REFCOUNT:
|
||||
case BPF_WORKQUEUE:
|
||||
|
@ -777,6 +779,7 @@ void bpf_obj_free_fields(const struct btf_record *rec, void *obj)
|
|||
|
||||
switch (fields[i].type) {
|
||||
case BPF_SPIN_LOCK:
|
||||
case BPF_RES_SPIN_LOCK:
|
||||
break;
|
||||
case BPF_TIMER:
|
||||
bpf_timer_cancel_and_free(field_ptr);
|
||||
|
@ -1212,7 +1215,7 @@ static int map_check_btf(struct bpf_map *map, struct bpf_token *token,
|
|||
return -EINVAL;
|
||||
|
||||
map->record = btf_parse_fields(btf, value_type,
|
||||
BPF_SPIN_LOCK | BPF_TIMER | BPF_KPTR | BPF_LIST_HEAD |
|
||||
BPF_SPIN_LOCK | BPF_RES_SPIN_LOCK | BPF_TIMER | BPF_KPTR | BPF_LIST_HEAD |
|
||||
BPF_RB_ROOT | BPF_REFCOUNT | BPF_WORKQUEUE | BPF_UPTR,
|
||||
map->value_size);
|
||||
if (!IS_ERR_OR_NULL(map->record)) {
|
||||
|
@ -1231,6 +1234,7 @@ static int map_check_btf(struct bpf_map *map, struct bpf_token *token,
|
|||
case 0:
|
||||
continue;
|
||||
case BPF_SPIN_LOCK:
|
||||
case BPF_RES_SPIN_LOCK:
|
||||
if (map->map_type != BPF_MAP_TYPE_HASH &&
|
||||
map->map_type != BPF_MAP_TYPE_ARRAY &&
|
||||
map->map_type != BPF_MAP_TYPE_CGROUP_STORAGE &&
|
||||
|
|
|
@ -456,7 +456,7 @@ static bool subprog_is_exc_cb(struct bpf_verifier_env *env, int subprog)
|
|||
|
||||
static bool reg_may_point_to_spin_lock(const struct bpf_reg_state *reg)
|
||||
{
|
||||
return btf_record_has_field(reg_btf_record(reg), BPF_SPIN_LOCK);
|
||||
return btf_record_has_field(reg_btf_record(reg), BPF_SPIN_LOCK | BPF_RES_SPIN_LOCK);
|
||||
}
|
||||
|
||||
static bool type_is_rdonly_mem(u32 type)
|
||||
|
@ -1155,7 +1155,8 @@ static int release_irq_state(struct bpf_verifier_state *state, int id);
|
|||
|
||||
static int mark_stack_slot_irq_flag(struct bpf_verifier_env *env,
|
||||
struct bpf_kfunc_call_arg_meta *meta,
|
||||
struct bpf_reg_state *reg, int insn_idx)
|
||||
struct bpf_reg_state *reg, int insn_idx,
|
||||
int kfunc_class)
|
||||
{
|
||||
struct bpf_func_state *state = func(env, reg);
|
||||
struct bpf_stack_state *slot;
|
||||
|
@ -1177,6 +1178,7 @@ static int mark_stack_slot_irq_flag(struct bpf_verifier_env *env,
|
|||
st->type = PTR_TO_STACK; /* we don't have dedicated reg type */
|
||||
st->live |= REG_LIVE_WRITTEN;
|
||||
st->ref_obj_id = id;
|
||||
st->irq.kfunc_class = kfunc_class;
|
||||
|
||||
for (i = 0; i < BPF_REG_SIZE; i++)
|
||||
slot->slot_type[i] = STACK_IRQ_FLAG;
|
||||
|
@ -1185,7 +1187,8 @@ static int mark_stack_slot_irq_flag(struct bpf_verifier_env *env,
|
|||
return 0;
|
||||
}
|
||||
|
||||
static int unmark_stack_slot_irq_flag(struct bpf_verifier_env *env, struct bpf_reg_state *reg)
|
||||
static int unmark_stack_slot_irq_flag(struct bpf_verifier_env *env, struct bpf_reg_state *reg,
|
||||
int kfunc_class)
|
||||
{
|
||||
struct bpf_func_state *state = func(env, reg);
|
||||
struct bpf_stack_state *slot;
|
||||
|
@ -1199,6 +1202,15 @@ static int unmark_stack_slot_irq_flag(struct bpf_verifier_env *env, struct bpf_r
|
|||
slot = &state->stack[spi];
|
||||
st = &slot->spilled_ptr;
|
||||
|
||||
if (st->irq.kfunc_class != kfunc_class) {
|
||||
const char *flag_kfunc = st->irq.kfunc_class == IRQ_NATIVE_KFUNC ? "native" : "lock";
|
||||
const char *used_kfunc = kfunc_class == IRQ_NATIVE_KFUNC ? "native" : "lock";
|
||||
|
||||
verbose(env, "irq flag acquired by %s kfuncs cannot be restored with %s kfuncs\n",
|
||||
flag_kfunc, used_kfunc);
|
||||
return -EINVAL;
|
||||
}
|
||||
|
||||
err = release_irq_state(env->cur_state, st->ref_obj_id);
|
||||
WARN_ON_ONCE(err && err != -EACCES);
|
||||
if (err) {
|
||||
|
@ -1416,6 +1428,8 @@ static int copy_reference_state(struct bpf_verifier_state *dst, const struct bpf
|
|||
dst->active_preempt_locks = src->active_preempt_locks;
|
||||
dst->active_rcu_lock = src->active_rcu_lock;
|
||||
dst->active_irq_id = src->active_irq_id;
|
||||
dst->active_lock_id = src->active_lock_id;
|
||||
dst->active_lock_ptr = src->active_lock_ptr;
|
||||
return 0;
|
||||
}
|
||||
|
||||
|
@ -1515,6 +1529,8 @@ static int acquire_lock_state(struct bpf_verifier_env *env, int insn_idx, enum r
|
|||
s->ptr = ptr;
|
||||
|
||||
state->active_locks++;
|
||||
state->active_lock_id = id;
|
||||
state->active_lock_ptr = ptr;
|
||||
return 0;
|
||||
}
|
||||
|
||||
|
@ -1565,16 +1581,24 @@ static bool find_reference_state(struct bpf_verifier_state *state, int ptr_id)
|
|||
|
||||
static int release_lock_state(struct bpf_verifier_state *state, int type, int id, void *ptr)
|
||||
{
|
||||
void *prev_ptr = NULL;
|
||||
u32 prev_id = 0;
|
||||
int i;
|
||||
|
||||
for (i = 0; i < state->acquired_refs; i++) {
|
||||
if (state->refs[i].type != type)
|
||||
continue;
|
||||
if (state->refs[i].id == id && state->refs[i].ptr == ptr) {
|
||||
if (state->refs[i].type == type && state->refs[i].id == id &&
|
||||
state->refs[i].ptr == ptr) {
|
||||
release_reference_state(state, i);
|
||||
state->active_locks--;
|
||||
/* Reassign active lock (id, ptr). */
|
||||
state->active_lock_id = prev_id;
|
||||
state->active_lock_ptr = prev_ptr;
|
||||
return 0;
|
||||
}
|
||||
if (state->refs[i].type & REF_TYPE_LOCK_MASK) {
|
||||
prev_id = state->refs[i].id;
|
||||
prev_ptr = state->refs[i].ptr;
|
||||
}
|
||||
}
|
||||
return -EINVAL;
|
||||
}
|
||||
|
@ -1609,7 +1633,7 @@ static struct bpf_reference_state *find_lock_state(struct bpf_verifier_state *st
|
|||
for (i = 0; i < state->acquired_refs; i++) {
|
||||
struct bpf_reference_state *s = &state->refs[i];
|
||||
|
||||
if (s->type != type)
|
||||
if (!(s->type & type))
|
||||
continue;
|
||||
|
||||
if (s->id == id && s->ptr == ptr)
|
||||
|
@ -8204,6 +8228,12 @@ static int check_kfunc_mem_size_reg(struct bpf_verifier_env *env, struct bpf_reg
|
|||
return err;
|
||||
}
|
||||
|
||||
enum {
|
||||
PROCESS_SPIN_LOCK = (1 << 0),
|
||||
PROCESS_RES_LOCK = (1 << 1),
|
||||
PROCESS_LOCK_IRQ = (1 << 2),
|
||||
};
|
||||
|
||||
/* Implementation details:
|
||||
* bpf_map_lookup returns PTR_TO_MAP_VALUE_OR_NULL.
|
||||
* bpf_obj_new returns PTR_TO_BTF_ID | MEM_ALLOC | PTR_MAYBE_NULL.
|
||||
|
@ -8226,30 +8256,33 @@ static int check_kfunc_mem_size_reg(struct bpf_verifier_env *env, struct bpf_reg
|
|||
* env->cur_state->active_locks remembers which map value element or allocated
|
||||
* object got locked and clears it after bpf_spin_unlock.
|
||||
*/
|
||||
static int process_spin_lock(struct bpf_verifier_env *env, int regno,
|
||||
bool is_lock)
|
||||
static int process_spin_lock(struct bpf_verifier_env *env, int regno, int flags)
|
||||
{
|
||||
bool is_lock = flags & PROCESS_SPIN_LOCK, is_res_lock = flags & PROCESS_RES_LOCK;
|
||||
const char *lock_str = is_res_lock ? "bpf_res_spin" : "bpf_spin";
|
||||
struct bpf_reg_state *regs = cur_regs(env), *reg = ®s[regno];
|
||||
struct bpf_verifier_state *cur = env->cur_state;
|
||||
bool is_const = tnum_is_const(reg->var_off);
|
||||
bool is_irq = flags & PROCESS_LOCK_IRQ;
|
||||
u64 val = reg->var_off.value;
|
||||
struct bpf_map *map = NULL;
|
||||
struct btf *btf = NULL;
|
||||
struct btf_record *rec;
|
||||
u32 spin_lock_off;
|
||||
int err;
|
||||
|
||||
if (!is_const) {
|
||||
verbose(env,
|
||||
"R%d doesn't have constant offset. bpf_spin_lock has to be at the constant offset\n",
|
||||
regno);
|
||||
"R%d doesn't have constant offset. %s_lock has to be at the constant offset\n",
|
||||
regno, lock_str);
|
||||
return -EINVAL;
|
||||
}
|
||||
if (reg->type == PTR_TO_MAP_VALUE) {
|
||||
map = reg->map_ptr;
|
||||
if (!map->btf) {
|
||||
verbose(env,
|
||||
"map '%s' has to have BTF in order to use bpf_spin_lock\n",
|
||||
map->name);
|
||||
"map '%s' has to have BTF in order to use %s_lock\n",
|
||||
map->name, lock_str);
|
||||
return -EINVAL;
|
||||
}
|
||||
} else {
|
||||
|
@ -8257,36 +8290,53 @@ static int process_spin_lock(struct bpf_verifier_env *env, int regno,
|
|||
}
|
||||
|
||||
rec = reg_btf_record(reg);
|
||||
if (!btf_record_has_field(rec, BPF_SPIN_LOCK)) {
|
||||
verbose(env, "%s '%s' has no valid bpf_spin_lock\n", map ? "map" : "local",
|
||||
map ? map->name : "kptr");
|
||||
if (!btf_record_has_field(rec, is_res_lock ? BPF_RES_SPIN_LOCK : BPF_SPIN_LOCK)) {
|
||||
verbose(env, "%s '%s' has no valid %s_lock\n", map ? "map" : "local",
|
||||
map ? map->name : "kptr", lock_str);
|
||||
return -EINVAL;
|
||||
}
|
||||
if (rec->spin_lock_off != val + reg->off) {
|
||||
verbose(env, "off %lld doesn't point to 'struct bpf_spin_lock' that is at %d\n",
|
||||
val + reg->off, rec->spin_lock_off);
|
||||
spin_lock_off = is_res_lock ? rec->res_spin_lock_off : rec->spin_lock_off;
|
||||
if (spin_lock_off != val + reg->off) {
|
||||
verbose(env, "off %lld doesn't point to 'struct %s_lock' that is at %d\n",
|
||||
val + reg->off, lock_str, spin_lock_off);
|
||||
return -EINVAL;
|
||||
}
|
||||
if (is_lock) {
|
||||
void *ptr;
|
||||
int type;
|
||||
|
||||
if (map)
|
||||
ptr = map;
|
||||
else
|
||||
ptr = btf;
|
||||
|
||||
if (cur->active_locks) {
|
||||
verbose(env,
|
||||
"Locking two bpf_spin_locks are not allowed\n");
|
||||
return -EINVAL;
|
||||
if (!is_res_lock && cur->active_locks) {
|
||||
if (find_lock_state(env->cur_state, REF_TYPE_LOCK, 0, NULL)) {
|
||||
verbose(env,
|
||||
"Locking two bpf_spin_locks are not allowed\n");
|
||||
return -EINVAL;
|
||||
}
|
||||
} else if (is_res_lock && cur->active_locks) {
|
||||
if (find_lock_state(env->cur_state, REF_TYPE_RES_LOCK | REF_TYPE_RES_LOCK_IRQ, reg->id, ptr)) {
|
||||
verbose(env, "Acquiring the same lock again, AA deadlock detected\n");
|
||||
return -EINVAL;
|
||||
}
|
||||
}
|
||||
err = acquire_lock_state(env, env->insn_idx, REF_TYPE_LOCK, reg->id, ptr);
|
||||
|
||||
if (is_res_lock && is_irq)
|
||||
type = REF_TYPE_RES_LOCK_IRQ;
|
||||
else if (is_res_lock)
|
||||
type = REF_TYPE_RES_LOCK;
|
||||
else
|
||||
type = REF_TYPE_LOCK;
|
||||
err = acquire_lock_state(env, env->insn_idx, type, reg->id, ptr);
|
||||
if (err < 0) {
|
||||
verbose(env, "Failed to acquire lock state\n");
|
||||
return err;
|
||||
}
|
||||
} else {
|
||||
void *ptr;
|
||||
int type;
|
||||
|
||||
if (map)
|
||||
ptr = map;
|
||||
|
@ -8294,12 +8344,26 @@ static int process_spin_lock(struct bpf_verifier_env *env, int regno,
|
|||
ptr = btf;
|
||||
|
||||
if (!cur->active_locks) {
|
||||
verbose(env, "bpf_spin_unlock without taking a lock\n");
|
||||
verbose(env, "%s_unlock without taking a lock\n", lock_str);
|
||||
return -EINVAL;
|
||||
}
|
||||
|
||||
if (release_lock_state(env->cur_state, REF_TYPE_LOCK, reg->id, ptr)) {
|
||||
verbose(env, "bpf_spin_unlock of different lock\n");
|
||||
if (is_res_lock && is_irq)
|
||||
type = REF_TYPE_RES_LOCK_IRQ;
|
||||
else if (is_res_lock)
|
||||
type = REF_TYPE_RES_LOCK;
|
||||
else
|
||||
type = REF_TYPE_LOCK;
|
||||
if (!find_lock_state(cur, type, reg->id, ptr)) {
|
||||
verbose(env, "%s_unlock of different lock\n", lock_str);
|
||||
return -EINVAL;
|
||||
}
|
||||
if (reg->id != cur->active_lock_id || ptr != cur->active_lock_ptr) {
|
||||
verbose(env, "%s_unlock cannot be out of order\n", lock_str);
|
||||
return -EINVAL;
|
||||
}
|
||||
if (release_lock_state(cur, type, reg->id, ptr)) {
|
||||
verbose(env, "%s_unlock of different lock\n", lock_str);
|
||||
return -EINVAL;
|
||||
}
|
||||
|
||||
|
@ -9625,11 +9689,11 @@ skip_type_check:
|
|||
return -EACCES;
|
||||
}
|
||||
if (meta->func_id == BPF_FUNC_spin_lock) {
|
||||
err = process_spin_lock(env, regno, true);
|
||||
err = process_spin_lock(env, regno, PROCESS_SPIN_LOCK);
|
||||
if (err)
|
||||
return err;
|
||||
} else if (meta->func_id == BPF_FUNC_spin_unlock) {
|
||||
err = process_spin_lock(env, regno, false);
|
||||
err = process_spin_lock(env, regno, 0);
|
||||
if (err)
|
||||
return err;
|
||||
} else {
|
||||
|
@ -11511,7 +11575,7 @@ static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn
|
|||
regs[BPF_REG_0].map_uid = meta.map_uid;
|
||||
regs[BPF_REG_0].type = PTR_TO_MAP_VALUE | ret_flag;
|
||||
if (!type_may_be_null(ret_flag) &&
|
||||
btf_record_has_field(meta.map_ptr->record, BPF_SPIN_LOCK)) {
|
||||
btf_record_has_field(meta.map_ptr->record, BPF_SPIN_LOCK | BPF_RES_SPIN_LOCK)) {
|
||||
regs[BPF_REG_0].id = ++env->id_gen;
|
||||
}
|
||||
break;
|
||||
|
@ -11683,10 +11747,10 @@ static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn
|
|||
/* mark_btf_func_reg_size() is used when the reg size is determined by
|
||||
* the BTF func_proto's return value size and argument.
|
||||
*/
|
||||
static void mark_btf_func_reg_size(struct bpf_verifier_env *env, u32 regno,
|
||||
size_t reg_size)
|
||||
static void __mark_btf_func_reg_size(struct bpf_verifier_env *env, struct bpf_reg_state *regs,
|
||||
u32 regno, size_t reg_size)
|
||||
{
|
||||
struct bpf_reg_state *reg = &cur_regs(env)[regno];
|
||||
struct bpf_reg_state *reg = ®s[regno];
|
||||
|
||||
if (regno == BPF_REG_0) {
|
||||
/* Function return value */
|
||||
|
@ -11704,6 +11768,12 @@ static void mark_btf_func_reg_size(struct bpf_verifier_env *env, u32 regno,
|
|||
}
|
||||
}
|
||||
|
||||
static void mark_btf_func_reg_size(struct bpf_verifier_env *env, u32 regno,
|
||||
size_t reg_size)
|
||||
{
|
||||
return __mark_btf_func_reg_size(env, cur_regs(env), regno, reg_size);
|
||||
}
|
||||
|
||||
static bool is_kfunc_acquire(struct bpf_kfunc_call_arg_meta *meta)
|
||||
{
|
||||
return meta->kfunc_flags & KF_ACQUIRE;
|
||||
|
@ -11841,6 +11911,7 @@ enum {
|
|||
KF_ARG_RB_ROOT_ID,
|
||||
KF_ARG_RB_NODE_ID,
|
||||
KF_ARG_WORKQUEUE_ID,
|
||||
KF_ARG_RES_SPIN_LOCK_ID,
|
||||
};
|
||||
|
||||
BTF_ID_LIST(kf_arg_btf_ids)
|
||||
|
@ -11850,6 +11921,7 @@ BTF_ID(struct, bpf_list_node)
|
|||
BTF_ID(struct, bpf_rb_root)
|
||||
BTF_ID(struct, bpf_rb_node)
|
||||
BTF_ID(struct, bpf_wq)
|
||||
BTF_ID(struct, bpf_res_spin_lock)
|
||||
|
||||
static bool __is_kfunc_ptr_arg_type(const struct btf *btf,
|
||||
const struct btf_param *arg, int type)
|
||||
|
@ -11898,6 +11970,11 @@ static bool is_kfunc_arg_wq(const struct btf *btf, const struct btf_param *arg)
|
|||
return __is_kfunc_ptr_arg_type(btf, arg, KF_ARG_WORKQUEUE_ID);
|
||||
}
|
||||
|
||||
static bool is_kfunc_arg_res_spin_lock(const struct btf *btf, const struct btf_param *arg)
|
||||
{
|
||||
return __is_kfunc_ptr_arg_type(btf, arg, KF_ARG_RES_SPIN_LOCK_ID);
|
||||
}
|
||||
|
||||
static bool is_kfunc_arg_callback(struct bpf_verifier_env *env, const struct btf *btf,
|
||||
const struct btf_param *arg)
|
||||
{
|
||||
|
@ -11969,6 +12046,7 @@ enum kfunc_ptr_arg_type {
|
|||
KF_ARG_PTR_TO_MAP,
|
||||
KF_ARG_PTR_TO_WORKQUEUE,
|
||||
KF_ARG_PTR_TO_IRQ_FLAG,
|
||||
KF_ARG_PTR_TO_RES_SPIN_LOCK,
|
||||
};
|
||||
|
||||
enum special_kfunc_type {
|
||||
|
@ -12007,6 +12085,10 @@ enum special_kfunc_type {
|
|||
KF_bpf_iter_num_destroy,
|
||||
KF_bpf_set_dentry_xattr,
|
||||
KF_bpf_remove_dentry_xattr,
|
||||
KF_bpf_res_spin_lock,
|
||||
KF_bpf_res_spin_unlock,
|
||||
KF_bpf_res_spin_lock_irqsave,
|
||||
KF_bpf_res_spin_unlock_irqrestore,
|
||||
};
|
||||
|
||||
BTF_SET_START(special_kfunc_set)
|
||||
|
@ -12096,6 +12178,10 @@ BTF_ID(func, bpf_remove_dentry_xattr)
|
|||
BTF_ID_UNUSED
|
||||
BTF_ID_UNUSED
|
||||
#endif
|
||||
BTF_ID(func, bpf_res_spin_lock)
|
||||
BTF_ID(func, bpf_res_spin_unlock)
|
||||
BTF_ID(func, bpf_res_spin_lock_irqsave)
|
||||
BTF_ID(func, bpf_res_spin_unlock_irqrestore)
|
||||
|
||||
static bool is_kfunc_ret_null(struct bpf_kfunc_call_arg_meta *meta)
|
||||
{
|
||||
|
@ -12189,6 +12275,9 @@ get_kfunc_ptr_arg_type(struct bpf_verifier_env *env,
|
|||
if (is_kfunc_arg_irq_flag(meta->btf, &args[argno]))
|
||||
return KF_ARG_PTR_TO_IRQ_FLAG;
|
||||
|
||||
if (is_kfunc_arg_res_spin_lock(meta->btf, &args[argno]))
|
||||
return KF_ARG_PTR_TO_RES_SPIN_LOCK;
|
||||
|
||||
if ((base_type(reg->type) == PTR_TO_BTF_ID || reg2btf_ids[base_type(reg->type)])) {
|
||||
if (!btf_type_is_struct(ref_t)) {
|
||||
verbose(env, "kernel function %s args#%d pointer type %s %s is not supported\n",
|
||||
|
@ -12296,13 +12385,19 @@ static int process_irq_flag(struct bpf_verifier_env *env, int regno,
|
|||
struct bpf_kfunc_call_arg_meta *meta)
|
||||
{
|
||||
struct bpf_reg_state *regs = cur_regs(env), *reg = ®s[regno];
|
||||
int err, kfunc_class = IRQ_NATIVE_KFUNC;
|
||||
bool irq_save;
|
||||
int err;
|
||||
|
||||
if (meta->func_id == special_kfunc_list[KF_bpf_local_irq_save]) {
|
||||
if (meta->func_id == special_kfunc_list[KF_bpf_local_irq_save] ||
|
||||
meta->func_id == special_kfunc_list[KF_bpf_res_spin_lock_irqsave]) {
|
||||
irq_save = true;
|
||||
} else if (meta->func_id == special_kfunc_list[KF_bpf_local_irq_restore]) {
|
||||
if (meta->func_id == special_kfunc_list[KF_bpf_res_spin_lock_irqsave])
|
||||
kfunc_class = IRQ_LOCK_KFUNC;
|
||||
} else if (meta->func_id == special_kfunc_list[KF_bpf_local_irq_restore] ||
|
||||
meta->func_id == special_kfunc_list[KF_bpf_res_spin_unlock_irqrestore]) {
|
||||
irq_save = false;
|
||||
if (meta->func_id == special_kfunc_list[KF_bpf_res_spin_unlock_irqrestore])
|
||||
kfunc_class = IRQ_LOCK_KFUNC;
|
||||
} else {
|
||||
verbose(env, "verifier internal error: unknown irq flags kfunc\n");
|
||||
return -EFAULT;
|
||||
|
@ -12318,7 +12413,7 @@ static int process_irq_flag(struct bpf_verifier_env *env, int regno,
|
|||
if (err)
|
||||
return err;
|
||||
|
||||
err = mark_stack_slot_irq_flag(env, meta, reg, env->insn_idx);
|
||||
err = mark_stack_slot_irq_flag(env, meta, reg, env->insn_idx, kfunc_class);
|
||||
if (err)
|
||||
return err;
|
||||
} else {
|
||||
|
@ -12332,7 +12427,7 @@ static int process_irq_flag(struct bpf_verifier_env *env, int regno,
|
|||
if (err)
|
||||
return err;
|
||||
|
||||
err = unmark_stack_slot_irq_flag(env, reg);
|
||||
err = unmark_stack_slot_irq_flag(env, reg, kfunc_class);
|
||||
if (err)
|
||||
return err;
|
||||
}
|
||||
|
@ -12459,7 +12554,7 @@ static int check_reg_allocation_locked(struct bpf_verifier_env *env, struct bpf_
|
|||
|
||||
if (!env->cur_state->active_locks)
|
||||
return -EINVAL;
|
||||
s = find_lock_state(env->cur_state, REF_TYPE_LOCK, id, ptr);
|
||||
s = find_lock_state(env->cur_state, REF_TYPE_LOCK_MASK, id, ptr);
|
||||
if (!s) {
|
||||
verbose(env, "held lock and object are not in the same allocation\n");
|
||||
return -EINVAL;
|
||||
|
@ -12495,9 +12590,18 @@ static bool is_bpf_graph_api_kfunc(u32 btf_id)
|
|||
btf_id == special_kfunc_list[KF_bpf_refcount_acquire_impl];
|
||||
}
|
||||
|
||||
static bool is_bpf_res_spin_lock_kfunc(u32 btf_id)
|
||||
{
|
||||
return btf_id == special_kfunc_list[KF_bpf_res_spin_lock] ||
|
||||
btf_id == special_kfunc_list[KF_bpf_res_spin_unlock] ||
|
||||
btf_id == special_kfunc_list[KF_bpf_res_spin_lock_irqsave] ||
|
||||
btf_id == special_kfunc_list[KF_bpf_res_spin_unlock_irqrestore];
|
||||
}
|
||||
|
||||
static bool kfunc_spin_allowed(u32 btf_id)
|
||||
{
|
||||
return is_bpf_graph_api_kfunc(btf_id) || is_bpf_iter_num_api_kfunc(btf_id);
|
||||
return is_bpf_graph_api_kfunc(btf_id) || is_bpf_iter_num_api_kfunc(btf_id) ||
|
||||
is_bpf_res_spin_lock_kfunc(btf_id);
|
||||
}
|
||||
|
||||
static bool is_sync_callback_calling_kfunc(u32 btf_id)
|
||||
|
@ -12929,6 +13033,7 @@ static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_call_
|
|||
case KF_ARG_PTR_TO_CONST_STR:
|
||||
case KF_ARG_PTR_TO_WORKQUEUE:
|
||||
case KF_ARG_PTR_TO_IRQ_FLAG:
|
||||
case KF_ARG_PTR_TO_RES_SPIN_LOCK:
|
||||
break;
|
||||
default:
|
||||
WARN_ON_ONCE(1);
|
||||
|
@ -13227,6 +13332,28 @@ static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_call_
|
|||
if (ret < 0)
|
||||
return ret;
|
||||
break;
|
||||
case KF_ARG_PTR_TO_RES_SPIN_LOCK:
|
||||
{
|
||||
int flags = PROCESS_RES_LOCK;
|
||||
|
||||
if (reg->type != PTR_TO_MAP_VALUE && reg->type != (PTR_TO_BTF_ID | MEM_ALLOC)) {
|
||||
verbose(env, "arg#%d doesn't point to map value or allocated object\n", i);
|
||||
return -EINVAL;
|
||||
}
|
||||
|
||||
if (!is_bpf_res_spin_lock_kfunc(meta->func_id))
|
||||
return -EFAULT;
|
||||
if (meta->func_id == special_kfunc_list[KF_bpf_res_spin_lock] ||
|
||||
meta->func_id == special_kfunc_list[KF_bpf_res_spin_lock_irqsave])
|
||||
flags |= PROCESS_SPIN_LOCK;
|
||||
if (meta->func_id == special_kfunc_list[KF_bpf_res_spin_lock_irqsave] ||
|
||||
meta->func_id == special_kfunc_list[KF_bpf_res_spin_unlock_irqrestore])
|
||||
flags |= PROCESS_LOCK_IRQ;
|
||||
ret = process_spin_lock(env, regno, flags);
|
||||
if (ret < 0)
|
||||
return ret;
|
||||
break;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
|
@ -13312,6 +13439,33 @@ static int check_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
|
|||
|
||||
insn_aux->is_iter_next = is_iter_next_kfunc(&meta);
|
||||
|
||||
if (!insn->off &&
|
||||
(insn->imm == special_kfunc_list[KF_bpf_res_spin_lock] ||
|
||||
insn->imm == special_kfunc_list[KF_bpf_res_spin_lock_irqsave])) {
|
||||
struct bpf_verifier_state *branch;
|
||||
struct bpf_reg_state *regs;
|
||||
|
||||
branch = push_stack(env, env->insn_idx + 1, env->insn_idx, false);
|
||||
if (!branch) {
|
||||
verbose(env, "failed to push state for failed lock acquisition\n");
|
||||
return -ENOMEM;
|
||||
}
|
||||
|
||||
regs = branch->frame[branch->curframe]->regs;
|
||||
|
||||
/* Clear r0-r5 registers in forked state */
|
||||
for (i = 0; i < CALLER_SAVED_REGS; i++)
|
||||
mark_reg_not_init(env, regs, caller_saved[i]);
|
||||
|
||||
mark_reg_unknown(env, regs, BPF_REG_0);
|
||||
err = __mark_reg_s32_range(env, regs, BPF_REG_0, -MAX_ERRNO, -1);
|
||||
if (err) {
|
||||
verbose(env, "failed to mark s32 range for retval in forked state for lock\n");
|
||||
return err;
|
||||
}
|
||||
__mark_btf_func_reg_size(env, regs, BPF_REG_0, sizeof(u32));
|
||||
}
|
||||
|
||||
if (is_kfunc_destructive(&meta) && !capable(CAP_SYS_BOOT)) {
|
||||
verbose(env, "destructive kfunc calls require CAP_SYS_BOOT capability\n");
|
||||
return -EACCES;
|
||||
|
@ -13482,6 +13636,9 @@ static int check_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
|
|||
|
||||
if (btf_type_is_scalar(t)) {
|
||||
mark_reg_unknown(env, regs, BPF_REG_0);
|
||||
if (meta.btf == btf_vmlinux && (meta.func_id == special_kfunc_list[KF_bpf_res_spin_lock] ||
|
||||
meta.func_id == special_kfunc_list[KF_bpf_res_spin_lock_irqsave]))
|
||||
__mark_reg_const_zero(env, ®s[BPF_REG_0]);
|
||||
mark_btf_func_reg_size(env, BPF_REG_0, t->size);
|
||||
} else if (btf_type_is_ptr(t)) {
|
||||
ptr_type = btf_type_skip_modifiers(desc_btf, t->type, &ptr_type_id);
|
||||
|
@ -18417,7 +18574,8 @@ static bool stacksafe(struct bpf_verifier_env *env, struct bpf_func_state *old,
|
|||
case STACK_IRQ_FLAG:
|
||||
old_reg = &old->stack[spi].spilled_ptr;
|
||||
cur_reg = &cur->stack[spi].spilled_ptr;
|
||||
if (!check_ids(old_reg->ref_obj_id, cur_reg->ref_obj_id, idmap))
|
||||
if (!check_ids(old_reg->ref_obj_id, cur_reg->ref_obj_id, idmap) ||
|
||||
old_reg->irq.kfunc_class != cur_reg->irq.kfunc_class)
|
||||
return false;
|
||||
break;
|
||||
case STACK_MISC:
|
||||
|
@ -18452,6 +18610,10 @@ static bool refsafe(struct bpf_verifier_state *old, struct bpf_verifier_state *c
|
|||
if (!check_ids(old->active_irq_id, cur->active_irq_id, idmap))
|
||||
return false;
|
||||
|
||||
if (!check_ids(old->active_lock_id, cur->active_lock_id, idmap) ||
|
||||
old->active_lock_ptr != cur->active_lock_ptr)
|
||||
return false;
|
||||
|
||||
for (i = 0; i < old->acquired_refs; i++) {
|
||||
if (!check_ids(old->refs[i].id, cur->refs[i].id, idmap) ||
|
||||
old->refs[i].type != cur->refs[i].type)
|
||||
|
@ -18461,6 +18623,8 @@ static bool refsafe(struct bpf_verifier_state *old, struct bpf_verifier_state *c
|
|||
case REF_TYPE_IRQ:
|
||||
break;
|
||||
case REF_TYPE_LOCK:
|
||||
case REF_TYPE_RES_LOCK:
|
||||
case REF_TYPE_RES_LOCK_IRQ:
|
||||
if (old->refs[i].ptr != cur->refs[i].ptr)
|
||||
return false;
|
||||
break;
|
||||
|
@ -19746,7 +19910,7 @@ static int check_map_prog_compatibility(struct bpf_verifier_env *env,
|
|||
}
|
||||
}
|
||||
|
||||
if (btf_record_has_field(map->record, BPF_SPIN_LOCK)) {
|
||||
if (btf_record_has_field(map->record, BPF_SPIN_LOCK | BPF_RES_SPIN_LOCK)) {
|
||||
if (prog_type == BPF_PROG_TYPE_SOCKET_FILTER) {
|
||||
verbose(env, "socket filter progs cannot use bpf_spin_lock yet\n");
|
||||
return -EINVAL;
|
||||
|
|
|
@ -49,6 +49,11 @@ LOCK_EVENT(lock_use_node4) /* # of locking ops that use 4th percpu node */
|
|||
LOCK_EVENT(lock_no_node) /* # of locking ops w/o using percpu node */
|
||||
#endif /* CONFIG_QUEUED_SPINLOCKS */
|
||||
|
||||
/*
|
||||
* Locking events for Resilient Queued Spin Lock
|
||||
*/
|
||||
LOCK_EVENT(rqspinlock_lock_timeout) /* # of locking ops that timeout */
|
||||
|
||||
/*
|
||||
* Locking events for rwsem
|
||||
*/
|
||||
|
|
|
@ -362,6 +362,60 @@ static struct lock_torture_ops raw_spin_lock_irq_ops = {
|
|||
.name = "raw_spin_lock_irq"
|
||||
};
|
||||
|
||||
#ifdef CONFIG_BPF_SYSCALL
|
||||
|
||||
#include <asm/rqspinlock.h>
|
||||
static rqspinlock_t rqspinlock;
|
||||
|
||||
static int torture_raw_res_spin_write_lock(int tid __maybe_unused)
|
||||
{
|
||||
raw_res_spin_lock(&rqspinlock);
|
||||
return 0;
|
||||
}
|
||||
|
||||
static void torture_raw_res_spin_write_unlock(int tid __maybe_unused)
|
||||
{
|
||||
raw_res_spin_unlock(&rqspinlock);
|
||||
}
|
||||
|
||||
static struct lock_torture_ops raw_res_spin_lock_ops = {
|
||||
.writelock = torture_raw_res_spin_write_lock,
|
||||
.write_delay = torture_spin_lock_write_delay,
|
||||
.task_boost = torture_rt_boost,
|
||||
.writeunlock = torture_raw_res_spin_write_unlock,
|
||||
.readlock = NULL,
|
||||
.read_delay = NULL,
|
||||
.readunlock = NULL,
|
||||
.name = "raw_res_spin_lock"
|
||||
};
|
||||
|
||||
static int torture_raw_res_spin_write_lock_irq(int tid __maybe_unused)
|
||||
{
|
||||
unsigned long flags;
|
||||
|
||||
raw_res_spin_lock_irqsave(&rqspinlock, flags);
|
||||
cxt.cur_ops->flags = flags;
|
||||
return 0;
|
||||
}
|
||||
|
||||
static void torture_raw_res_spin_write_unlock_irq(int tid __maybe_unused)
|
||||
{
|
||||
raw_res_spin_unlock_irqrestore(&rqspinlock, cxt.cur_ops->flags);
|
||||
}
|
||||
|
||||
static struct lock_torture_ops raw_res_spin_lock_irq_ops = {
|
||||
.writelock = torture_raw_res_spin_write_lock_irq,
|
||||
.write_delay = torture_spin_lock_write_delay,
|
||||
.task_boost = torture_rt_boost,
|
||||
.writeunlock = torture_raw_res_spin_write_unlock_irq,
|
||||
.readlock = NULL,
|
||||
.read_delay = NULL,
|
||||
.readunlock = NULL,
|
||||
.name = "raw_res_spin_lock_irq"
|
||||
};
|
||||
|
||||
#endif
|
||||
|
||||
static DEFINE_RWLOCK(torture_rwlock);
|
||||
|
||||
static int torture_rwlock_write_lock(int tid __maybe_unused)
|
||||
|
@ -1168,6 +1222,9 @@ static int __init lock_torture_init(void)
|
|||
&lock_busted_ops,
|
||||
&spin_lock_ops, &spin_lock_irq_ops,
|
||||
&raw_spin_lock_ops, &raw_spin_lock_irq_ops,
|
||||
#ifdef CONFIG_BPF_SYSCALL
|
||||
&raw_res_spin_lock_ops, &raw_res_spin_lock_irq_ops,
|
||||
#endif
|
||||
&rw_lock_ops, &rw_lock_irq_ops,
|
||||
&mutex_lock_ops,
|
||||
&ww_mutex_lock_ops,
|
||||
|
|
|
@ -15,12 +15,6 @@
|
|||
|
||||
#include <asm/mcs_spinlock.h>
|
||||
|
||||
struct mcs_spinlock {
|
||||
struct mcs_spinlock *next;
|
||||
int locked; /* 1 if lock acquired */
|
||||
int count; /* nesting count, see qspinlock.c */
|
||||
};
|
||||
|
||||
#ifndef arch_mcs_spin_lock_contended
|
||||
/*
|
||||
* Using smp_cond_load_acquire() provides the acquire semantics
|
||||
|
@ -30,9 +24,7 @@ struct mcs_spinlock {
|
|||
* spinning, and smp_cond_load_acquire() provides that behavior.
|
||||
*/
|
||||
#define arch_mcs_spin_lock_contended(l) \
|
||||
do { \
|
||||
smp_cond_load_acquire(l, VAL); \
|
||||
} while (0)
|
||||
smp_cond_load_acquire(l, VAL)
|
||||
#endif
|
||||
|
||||
#ifndef arch_mcs_spin_unlock_contended
|
||||
|
|
|
@ -25,8 +25,9 @@
|
|||
#include <trace/events/lock.h>
|
||||
|
||||
/*
|
||||
* Include queued spinlock statistics code
|
||||
* Include queued spinlock definitions and statistics code
|
||||
*/
|
||||
#include "qspinlock.h"
|
||||
#include "qspinlock_stat.h"
|
||||
|
||||
/*
|
||||
|
@ -67,36 +68,6 @@
|
|||
*/
|
||||
|
||||
#include "mcs_spinlock.h"
|
||||
#define MAX_NODES 4
|
||||
|
||||
/*
|
||||
* On 64-bit architectures, the mcs_spinlock structure will be 16 bytes in
|
||||
* size and four of them will fit nicely in one 64-byte cacheline. For
|
||||
* pvqspinlock, however, we need more space for extra data. To accommodate
|
||||
* that, we insert two more long words to pad it up to 32 bytes. IOW, only
|
||||
* two of them can fit in a cacheline in this case. That is OK as it is rare
|
||||
* to have more than 2 levels of slowpath nesting in actual use. We don't
|
||||
* want to penalize pvqspinlocks to optimize for a rare case in native
|
||||
* qspinlocks.
|
||||
*/
|
||||
struct qnode {
|
||||
struct mcs_spinlock mcs;
|
||||
#ifdef CONFIG_PARAVIRT_SPINLOCKS
|
||||
long reserved[2];
|
||||
#endif
|
||||
};
|
||||
|
||||
/*
|
||||
* The pending bit spinning loop count.
|
||||
* This heuristic is used to limit the number of lockword accesses
|
||||
* made by atomic_cond_read_relaxed when waiting for the lock to
|
||||
* transition out of the "== _Q_PENDING_VAL" state. We don't spin
|
||||
* indefinitely because there's no guarantee that we'll make forward
|
||||
* progress.
|
||||
*/
|
||||
#ifndef _Q_PENDING_LOOPS
|
||||
#define _Q_PENDING_LOOPS 1
|
||||
#endif
|
||||
|
||||
/*
|
||||
* Per-CPU queue node structures; we can never have more than 4 nested
|
||||
|
@ -106,161 +77,7 @@ struct qnode {
|
|||
*
|
||||
* PV doubles the storage and uses the second cacheline for PV state.
|
||||
*/
|
||||
static DEFINE_PER_CPU_ALIGNED(struct qnode, qnodes[MAX_NODES]);
|
||||
|
||||
/*
|
||||
* We must be able to distinguish between no-tail and the tail at 0:0,
|
||||
* therefore increment the cpu number by one.
|
||||
*/
|
||||
|
||||
static inline __pure u32 encode_tail(int cpu, int idx)
|
||||
{
|
||||
u32 tail;
|
||||
|
||||
tail = (cpu + 1) << _Q_TAIL_CPU_OFFSET;
|
||||
tail |= idx << _Q_TAIL_IDX_OFFSET; /* assume < 4 */
|
||||
|
||||
return tail;
|
||||
}
|
||||
|
||||
static inline __pure struct mcs_spinlock *decode_tail(u32 tail)
|
||||
{
|
||||
int cpu = (tail >> _Q_TAIL_CPU_OFFSET) - 1;
|
||||
int idx = (tail & _Q_TAIL_IDX_MASK) >> _Q_TAIL_IDX_OFFSET;
|
||||
|
||||
return per_cpu_ptr(&qnodes[idx].mcs, cpu);
|
||||
}
|
||||
|
||||
static inline __pure
|
||||
struct mcs_spinlock *grab_mcs_node(struct mcs_spinlock *base, int idx)
|
||||
{
|
||||
return &((struct qnode *)base + idx)->mcs;
|
||||
}
|
||||
|
||||
#define _Q_LOCKED_PENDING_MASK (_Q_LOCKED_MASK | _Q_PENDING_MASK)
|
||||
|
||||
#if _Q_PENDING_BITS == 8
|
||||
/**
|
||||
* clear_pending - clear the pending bit.
|
||||
* @lock: Pointer to queued spinlock structure
|
||||
*
|
||||
* *,1,* -> *,0,*
|
||||
*/
|
||||
static __always_inline void clear_pending(struct qspinlock *lock)
|
||||
{
|
||||
WRITE_ONCE(lock->pending, 0);
|
||||
}
|
||||
|
||||
/**
|
||||
* clear_pending_set_locked - take ownership and clear the pending bit.
|
||||
* @lock: Pointer to queued spinlock structure
|
||||
*
|
||||
* *,1,0 -> *,0,1
|
||||
*
|
||||
* Lock stealing is not allowed if this function is used.
|
||||
*/
|
||||
static __always_inline void clear_pending_set_locked(struct qspinlock *lock)
|
||||
{
|
||||
WRITE_ONCE(lock->locked_pending, _Q_LOCKED_VAL);
|
||||
}
|
||||
|
||||
/*
|
||||
* xchg_tail - Put in the new queue tail code word & retrieve previous one
|
||||
* @lock : Pointer to queued spinlock structure
|
||||
* @tail : The new queue tail code word
|
||||
* Return: The previous queue tail code word
|
||||
*
|
||||
* xchg(lock, tail), which heads an address dependency
|
||||
*
|
||||
* p,*,* -> n,*,* ; prev = xchg(lock, node)
|
||||
*/
|
||||
static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail)
|
||||
{
|
||||
/*
|
||||
* We can use relaxed semantics since the caller ensures that the
|
||||
* MCS node is properly initialized before updating the tail.
|
||||
*/
|
||||
return (u32)xchg_relaxed(&lock->tail,
|
||||
tail >> _Q_TAIL_OFFSET) << _Q_TAIL_OFFSET;
|
||||
}
|
||||
|
||||
#else /* _Q_PENDING_BITS == 8 */
|
||||
|
||||
/**
|
||||
* clear_pending - clear the pending bit.
|
||||
* @lock: Pointer to queued spinlock structure
|
||||
*
|
||||
* *,1,* -> *,0,*
|
||||
*/
|
||||
static __always_inline void clear_pending(struct qspinlock *lock)
|
||||
{
|
||||
atomic_andnot(_Q_PENDING_VAL, &lock->val);
|
||||
}
|
||||
|
||||
/**
|
||||
* clear_pending_set_locked - take ownership and clear the pending bit.
|
||||
* @lock: Pointer to queued spinlock structure
|
||||
*
|
||||
* *,1,0 -> *,0,1
|
||||
*/
|
||||
static __always_inline void clear_pending_set_locked(struct qspinlock *lock)
|
||||
{
|
||||
atomic_add(-_Q_PENDING_VAL + _Q_LOCKED_VAL, &lock->val);
|
||||
}
|
||||
|
||||
/**
|
||||
* xchg_tail - Put in the new queue tail code word & retrieve previous one
|
||||
* @lock : Pointer to queued spinlock structure
|
||||
* @tail : The new queue tail code word
|
||||
* Return: The previous queue tail code word
|
||||
*
|
||||
* xchg(lock, tail)
|
||||
*
|
||||
* p,*,* -> n,*,* ; prev = xchg(lock, node)
|
||||
*/
|
||||
static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail)
|
||||
{
|
||||
u32 old, new;
|
||||
|
||||
old = atomic_read(&lock->val);
|
||||
do {
|
||||
new = (old & _Q_LOCKED_PENDING_MASK) | tail;
|
||||
/*
|
||||
* We can use relaxed semantics since the caller ensures that
|
||||
* the MCS node is properly initialized before updating the
|
||||
* tail.
|
||||
*/
|
||||
} while (!atomic_try_cmpxchg_relaxed(&lock->val, &old, new));
|
||||
|
||||
return old;
|
||||
}
|
||||
#endif /* _Q_PENDING_BITS == 8 */
|
||||
|
||||
/**
|
||||
* queued_fetch_set_pending_acquire - fetch the whole lock value and set pending
|
||||
* @lock : Pointer to queued spinlock structure
|
||||
* Return: The previous lock value
|
||||
*
|
||||
* *,*,* -> *,1,*
|
||||
*/
|
||||
#ifndef queued_fetch_set_pending_acquire
|
||||
static __always_inline u32 queued_fetch_set_pending_acquire(struct qspinlock *lock)
|
||||
{
|
||||
return atomic_fetch_or_acquire(_Q_PENDING_VAL, &lock->val);
|
||||
}
|
||||
#endif
|
||||
|
||||
/**
|
||||
* set_locked - Set the lock bit and own the lock
|
||||
* @lock: Pointer to queued spinlock structure
|
||||
*
|
||||
* *,*,0 -> *,0,1
|
||||
*/
|
||||
static __always_inline void set_locked(struct qspinlock *lock)
|
||||
{
|
||||
WRITE_ONCE(lock->locked, _Q_LOCKED_VAL);
|
||||
}
|
||||
|
||||
static DEFINE_PER_CPU_ALIGNED(struct qnode, qnodes[_Q_MAX_NODES]);
|
||||
|
||||
/*
|
||||
* Generate the native code for queued_spin_unlock_slowpath(); provide NOPs for
|
||||
|
@ -410,7 +227,7 @@ pv_queue:
|
|||
* any MCS node. This is not the most elegant solution, but is
|
||||
* simple enough.
|
||||
*/
|
||||
if (unlikely(idx >= MAX_NODES)) {
|
||||
if (unlikely(idx >= _Q_MAX_NODES)) {
|
||||
lockevent_inc(lock_no_node);
|
||||
while (!queued_spin_trylock(lock))
|
||||
cpu_relax();
|
||||
|
@ -465,7 +282,7 @@ pv_queue:
|
|||
* head of the waitqueue.
|
||||
*/
|
||||
if (old & _Q_TAIL_MASK) {
|
||||
prev = decode_tail(old);
|
||||
prev = decode_tail(old, qnodes);
|
||||
|
||||
/* Link @node into the waitqueue. */
|
||||
WRITE_ONCE(prev->next, node);
|
||||
|
|
201
kernel/locking/qspinlock.h
Normal file
201
kernel/locking/qspinlock.h
Normal file
|
@ -0,0 +1,201 @@
|
|||
/* SPDX-License-Identifier: GPL-2.0-or-later */
|
||||
/*
|
||||
* Queued spinlock defines
|
||||
*
|
||||
* This file contains macro definitions and functions shared between different
|
||||
* qspinlock slow path implementations.
|
||||
*/
|
||||
#ifndef __LINUX_QSPINLOCK_H
|
||||
#define __LINUX_QSPINLOCK_H
|
||||
|
||||
#include <asm-generic/percpu.h>
|
||||
#include <linux/percpu-defs.h>
|
||||
#include <asm-generic/qspinlock.h>
|
||||
#include <asm-generic/mcs_spinlock.h>
|
||||
|
||||
#define _Q_MAX_NODES 4
|
||||
|
||||
/*
|
||||
* The pending bit spinning loop count.
|
||||
* This heuristic is used to limit the number of lockword accesses
|
||||
* made by atomic_cond_read_relaxed when waiting for the lock to
|
||||
* transition out of the "== _Q_PENDING_VAL" state. We don't spin
|
||||
* indefinitely because there's no guarantee that we'll make forward
|
||||
* progress.
|
||||
*/
|
||||
#ifndef _Q_PENDING_LOOPS
|
||||
#define _Q_PENDING_LOOPS 1
|
||||
#endif
|
||||
|
||||
/*
|
||||
* On 64-bit architectures, the mcs_spinlock structure will be 16 bytes in
|
||||
* size and four of them will fit nicely in one 64-byte cacheline. For
|
||||
* pvqspinlock, however, we need more space for extra data. To accommodate
|
||||
* that, we insert two more long words to pad it up to 32 bytes. IOW, only
|
||||
* two of them can fit in a cacheline in this case. That is OK as it is rare
|
||||
* to have more than 2 levels of slowpath nesting in actual use. We don't
|
||||
* want to penalize pvqspinlocks to optimize for a rare case in native
|
||||
* qspinlocks.
|
||||
*/
|
||||
struct qnode {
|
||||
struct mcs_spinlock mcs;
|
||||
#ifdef CONFIG_PARAVIRT_SPINLOCKS
|
||||
long reserved[2];
|
||||
#endif
|
||||
};
|
||||
|
||||
/*
|
||||
* We must be able to distinguish between no-tail and the tail at 0:0,
|
||||
* therefore increment the cpu number by one.
|
||||
*/
|
||||
|
||||
static inline __pure u32 encode_tail(int cpu, int idx)
|
||||
{
|
||||
u32 tail;
|
||||
|
||||
tail = (cpu + 1) << _Q_TAIL_CPU_OFFSET;
|
||||
tail |= idx << _Q_TAIL_IDX_OFFSET; /* assume < 4 */
|
||||
|
||||
return tail;
|
||||
}
|
||||
|
||||
static inline __pure struct mcs_spinlock *decode_tail(u32 tail,
|
||||
struct qnode __percpu *qnodes)
|
||||
{
|
||||
int cpu = (tail >> _Q_TAIL_CPU_OFFSET) - 1;
|
||||
int idx = (tail & _Q_TAIL_IDX_MASK) >> _Q_TAIL_IDX_OFFSET;
|
||||
|
||||
return per_cpu_ptr(&qnodes[idx].mcs, cpu);
|
||||
}
|
||||
|
||||
static inline __pure
|
||||
struct mcs_spinlock *grab_mcs_node(struct mcs_spinlock *base, int idx)
|
||||
{
|
||||
return &((struct qnode *)base + idx)->mcs;
|
||||
}
|
||||
|
||||
#define _Q_LOCKED_PENDING_MASK (_Q_LOCKED_MASK | _Q_PENDING_MASK)
|
||||
|
||||
#if _Q_PENDING_BITS == 8
|
||||
/**
|
||||
* clear_pending - clear the pending bit.
|
||||
* @lock: Pointer to queued spinlock structure
|
||||
*
|
||||
* *,1,* -> *,0,*
|
||||
*/
|
||||
static __always_inline void clear_pending(struct qspinlock *lock)
|
||||
{
|
||||
WRITE_ONCE(lock->pending, 0);
|
||||
}
|
||||
|
||||
/**
|
||||
* clear_pending_set_locked - take ownership and clear the pending bit.
|
||||
* @lock: Pointer to queued spinlock structure
|
||||
*
|
||||
* *,1,0 -> *,0,1
|
||||
*
|
||||
* Lock stealing is not allowed if this function is used.
|
||||
*/
|
||||
static __always_inline void clear_pending_set_locked(struct qspinlock *lock)
|
||||
{
|
||||
WRITE_ONCE(lock->locked_pending, _Q_LOCKED_VAL);
|
||||
}
|
||||
|
||||
/*
|
||||
* xchg_tail - Put in the new queue tail code word & retrieve previous one
|
||||
* @lock : Pointer to queued spinlock structure
|
||||
* @tail : The new queue tail code word
|
||||
* Return: The previous queue tail code word
|
||||
*
|
||||
* xchg(lock, tail), which heads an address dependency
|
||||
*
|
||||
* p,*,* -> n,*,* ; prev = xchg(lock, node)
|
||||
*/
|
||||
static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail)
|
||||
{
|
||||
/*
|
||||
* We can use relaxed semantics since the caller ensures that the
|
||||
* MCS node is properly initialized before updating the tail.
|
||||
*/
|
||||
return (u32)xchg_relaxed(&lock->tail,
|
||||
tail >> _Q_TAIL_OFFSET) << _Q_TAIL_OFFSET;
|
||||
}
|
||||
|
||||
#else /* _Q_PENDING_BITS == 8 */
|
||||
|
||||
/**
|
||||
* clear_pending - clear the pending bit.
|
||||
* @lock: Pointer to queued spinlock structure
|
||||
*
|
||||
* *,1,* -> *,0,*
|
||||
*/
|
||||
static __always_inline void clear_pending(struct qspinlock *lock)
|
||||
{
|
||||
atomic_andnot(_Q_PENDING_VAL, &lock->val);
|
||||
}
|
||||
|
||||
/**
|
||||
* clear_pending_set_locked - take ownership and clear the pending bit.
|
||||
* @lock: Pointer to queued spinlock structure
|
||||
*
|
||||
* *,1,0 -> *,0,1
|
||||
*/
|
||||
static __always_inline void clear_pending_set_locked(struct qspinlock *lock)
|
||||
{
|
||||
atomic_add(-_Q_PENDING_VAL + _Q_LOCKED_VAL, &lock->val);
|
||||
}
|
||||
|
||||
/**
|
||||
* xchg_tail - Put in the new queue tail code word & retrieve previous one
|
||||
* @lock : Pointer to queued spinlock structure
|
||||
* @tail : The new queue tail code word
|
||||
* Return: The previous queue tail code word
|
||||
*
|
||||
* xchg(lock, tail)
|
||||
*
|
||||
* p,*,* -> n,*,* ; prev = xchg(lock, node)
|
||||
*/
|
||||
static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail)
|
||||
{
|
||||
u32 old, new;
|
||||
|
||||
old = atomic_read(&lock->val);
|
||||
do {
|
||||
new = (old & _Q_LOCKED_PENDING_MASK) | tail;
|
||||
/*
|
||||
* We can use relaxed semantics since the caller ensures that
|
||||
* the MCS node is properly initialized before updating the
|
||||
* tail.
|
||||
*/
|
||||
} while (!atomic_try_cmpxchg_relaxed(&lock->val, &old, new));
|
||||
|
||||
return old;
|
||||
}
|
||||
#endif /* _Q_PENDING_BITS == 8 */
|
||||
|
||||
/**
|
||||
* queued_fetch_set_pending_acquire - fetch the whole lock value and set pending
|
||||
* @lock : Pointer to queued spinlock structure
|
||||
* Return: The previous lock value
|
||||
*
|
||||
* *,*,* -> *,1,*
|
||||
*/
|
||||
#ifndef queued_fetch_set_pending_acquire
|
||||
static __always_inline u32 queued_fetch_set_pending_acquire(struct qspinlock *lock)
|
||||
{
|
||||
return atomic_fetch_or_acquire(_Q_PENDING_VAL, &lock->val);
|
||||
}
|
||||
#endif
|
||||
|
||||
/**
|
||||
* set_locked - Set the lock bit and own the lock
|
||||
* @lock: Pointer to queued spinlock structure
|
||||
*
|
||||
* *,*,0 -> *,0,1
|
||||
*/
|
||||
static __always_inline void set_locked(struct qspinlock *lock)
|
||||
{
|
||||
WRITE_ONCE(lock->locked, _Q_LOCKED_VAL);
|
||||
}
|
||||
|
||||
#endif /* __LINUX_QSPINLOCK_H */
|
98
tools/testing/selftests/bpf/prog_tests/res_spin_lock.c
Normal file
98
tools/testing/selftests/bpf/prog_tests/res_spin_lock.c
Normal file
|
@ -0,0 +1,98 @@
|
|||
// SPDX-License-Identifier: GPL-2.0
|
||||
/* Copyright (c) 2024-2025 Meta Platforms, Inc. and affiliates. */
|
||||
#include <test_progs.h>
|
||||
#include <network_helpers.h>
|
||||
#include <sys/sysinfo.h>
|
||||
|
||||
#include "res_spin_lock.skel.h"
|
||||
#include "res_spin_lock_fail.skel.h"
|
||||
|
||||
void test_res_spin_lock_failure(void)
|
||||
{
|
||||
RUN_TESTS(res_spin_lock_fail);
|
||||
}
|
||||
|
||||
static volatile int skip;
|
||||
|
||||
static void *spin_lock_thread(void *arg)
|
||||
{
|
||||
int err, prog_fd = *(u32 *) arg;
|
||||
LIBBPF_OPTS(bpf_test_run_opts, topts,
|
||||
.data_in = &pkt_v4,
|
||||
.data_size_in = sizeof(pkt_v4),
|
||||
.repeat = 10000,
|
||||
);
|
||||
|
||||
while (!READ_ONCE(skip)) {
|
||||
err = bpf_prog_test_run_opts(prog_fd, &topts);
|
||||
ASSERT_OK(err, "test_run");
|
||||
ASSERT_OK(topts.retval, "test_run retval");
|
||||
}
|
||||
pthread_exit(arg);
|
||||
}
|
||||
|
||||
void test_res_spin_lock_success(void)
|
||||
{
|
||||
LIBBPF_OPTS(bpf_test_run_opts, topts,
|
||||
.data_in = &pkt_v4,
|
||||
.data_size_in = sizeof(pkt_v4),
|
||||
.repeat = 1,
|
||||
);
|
||||
struct res_spin_lock *skel;
|
||||
pthread_t thread_id[16];
|
||||
int prog_fd, i, err;
|
||||
void *ret;
|
||||
|
||||
if (get_nprocs() < 2) {
|
||||
test__skip();
|
||||
return;
|
||||
}
|
||||
|
||||
skel = res_spin_lock__open_and_load();
|
||||
if (!ASSERT_OK_PTR(skel, "res_spin_lock__open_and_load"))
|
||||
return;
|
||||
/* AA deadlock */
|
||||
prog_fd = bpf_program__fd(skel->progs.res_spin_lock_test);
|
||||
err = bpf_prog_test_run_opts(prog_fd, &topts);
|
||||
ASSERT_OK(err, "error");
|
||||
ASSERT_OK(topts.retval, "retval");
|
||||
|
||||
prog_fd = bpf_program__fd(skel->progs.res_spin_lock_test_held_lock_max);
|
||||
err = bpf_prog_test_run_opts(prog_fd, &topts);
|
||||
ASSERT_OK(err, "error");
|
||||
ASSERT_OK(topts.retval, "retval");
|
||||
|
||||
/* Multi-threaded ABBA deadlock. */
|
||||
|
||||
prog_fd = bpf_program__fd(skel->progs.res_spin_lock_test_AB);
|
||||
for (i = 0; i < 16; i++) {
|
||||
int err;
|
||||
|
||||
err = pthread_create(&thread_id[i], NULL, &spin_lock_thread, &prog_fd);
|
||||
if (!ASSERT_OK(err, "pthread_create"))
|
||||
goto end;
|
||||
}
|
||||
|
||||
topts.retval = 0;
|
||||
topts.repeat = 1000;
|
||||
int fd = bpf_program__fd(skel->progs.res_spin_lock_test_BA);
|
||||
while (!topts.retval && !err && !READ_ONCE(skel->bss->err)) {
|
||||
err = bpf_prog_test_run_opts(fd, &topts);
|
||||
}
|
||||
|
||||
WRITE_ONCE(skip, true);
|
||||
|
||||
for (i = 0; i < 16; i++) {
|
||||
if (!ASSERT_OK(pthread_join(thread_id[i], &ret), "pthread_join"))
|
||||
goto end;
|
||||
if (!ASSERT_EQ(ret, &prog_fd, "ret == prog_fd"))
|
||||
goto end;
|
||||
}
|
||||
|
||||
ASSERT_EQ(READ_ONCE(skel->bss->err), -EDEADLK, "timeout err");
|
||||
ASSERT_OK(err, "err");
|
||||
ASSERT_EQ(topts.retval, -EDEADLK, "timeout");
|
||||
end:
|
||||
res_spin_lock__destroy(skel);
|
||||
return;
|
||||
}
|
|
@ -11,6 +11,9 @@ extern void bpf_local_irq_save(unsigned long *) __weak __ksym;
|
|||
extern void bpf_local_irq_restore(unsigned long *) __weak __ksym;
|
||||
extern int bpf_copy_from_user_str(void *dst, u32 dst__sz, const void *unsafe_ptr__ign, u64 flags) __weak __ksym;
|
||||
|
||||
struct bpf_res_spin_lock lockA __hidden SEC(".data.A");
|
||||
struct bpf_res_spin_lock lockB __hidden SEC(".data.B");
|
||||
|
||||
SEC("?tc")
|
||||
__failure __msg("arg#0 doesn't point to an irq flag on stack")
|
||||
int irq_save_bad_arg(struct __sk_buff *ctx)
|
||||
|
@ -510,4 +513,54 @@ int irq_sleepable_global_subprog_indirect(void *ctx)
|
|||
return 0;
|
||||
}
|
||||
|
||||
SEC("?tc")
|
||||
__failure __msg("cannot restore irq state out of order")
|
||||
int irq_ooo_lock_cond_inv(struct __sk_buff *ctx)
|
||||
{
|
||||
unsigned long flags1, flags2;
|
||||
|
||||
if (bpf_res_spin_lock_irqsave(&lockA, &flags1))
|
||||
return 0;
|
||||
if (bpf_res_spin_lock_irqsave(&lockB, &flags2)) {
|
||||
bpf_res_spin_unlock_irqrestore(&lockA, &flags1);
|
||||
return 0;
|
||||
}
|
||||
|
||||
bpf_res_spin_unlock_irqrestore(&lockB, &flags1);
|
||||
bpf_res_spin_unlock_irqrestore(&lockA, &flags2);
|
||||
return 0;
|
||||
}
|
||||
|
||||
SEC("?tc")
|
||||
__failure __msg("function calls are not allowed")
|
||||
int irq_wrong_kfunc_class_1(struct __sk_buff *ctx)
|
||||
{
|
||||
unsigned long flags1;
|
||||
|
||||
if (bpf_res_spin_lock_irqsave(&lockA, &flags1))
|
||||
return 0;
|
||||
/* For now, bpf_local_irq_restore is not allowed in critical section,
|
||||
* but this test ensures error will be caught with kfunc_class when it's
|
||||
* opened up. Tested by temporarily permitting this kfunc in critical
|
||||
* section.
|
||||
*/
|
||||
bpf_local_irq_restore(&flags1);
|
||||
bpf_res_spin_unlock_irqrestore(&lockA, &flags1);
|
||||
return 0;
|
||||
}
|
||||
|
||||
SEC("?tc")
|
||||
__failure __msg("function calls are not allowed")
|
||||
int irq_wrong_kfunc_class_2(struct __sk_buff *ctx)
|
||||
{
|
||||
unsigned long flags1, flags2;
|
||||
|
||||
bpf_local_irq_save(&flags1);
|
||||
if (bpf_res_spin_lock_irqsave(&lockA, &flags2))
|
||||
return 0;
|
||||
bpf_local_irq_restore(&flags2);
|
||||
bpf_res_spin_unlock_irqrestore(&lockA, &flags1);
|
||||
return 0;
|
||||
}
|
||||
|
||||
char _license[] SEC("license") = "GPL";
|
||||
|
|
143
tools/testing/selftests/bpf/progs/res_spin_lock.c
Normal file
143
tools/testing/selftests/bpf/progs/res_spin_lock.c
Normal file
|
@ -0,0 +1,143 @@
|
|||
// SPDX-License-Identifier: GPL-2.0
|
||||
/* Copyright (c) 2024-2025 Meta Platforms, Inc. and affiliates. */
|
||||
#include <vmlinux.h>
|
||||
#include <bpf/bpf_tracing.h>
|
||||
#include <bpf/bpf_helpers.h>
|
||||
#include "bpf_misc.h"
|
||||
|
||||
#define EDEADLK 35
|
||||
#define ETIMEDOUT 110
|
||||
|
||||
struct arr_elem {
|
||||
struct bpf_res_spin_lock lock;
|
||||
};
|
||||
|
||||
struct {
|
||||
__uint(type, BPF_MAP_TYPE_ARRAY);
|
||||
__uint(max_entries, 64);
|
||||
__type(key, int);
|
||||
__type(value, struct arr_elem);
|
||||
} arrmap SEC(".maps");
|
||||
|
||||
struct bpf_res_spin_lock lockA __hidden SEC(".data.A");
|
||||
struct bpf_res_spin_lock lockB __hidden SEC(".data.B");
|
||||
|
||||
SEC("tc")
|
||||
int res_spin_lock_test(struct __sk_buff *ctx)
|
||||
{
|
||||
struct arr_elem *elem1, *elem2;
|
||||
int r;
|
||||
|
||||
elem1 = bpf_map_lookup_elem(&arrmap, &(int){0});
|
||||
if (!elem1)
|
||||
return -1;
|
||||
elem2 = bpf_map_lookup_elem(&arrmap, &(int){0});
|
||||
if (!elem2)
|
||||
return -1;
|
||||
|
||||
r = bpf_res_spin_lock(&elem1->lock);
|
||||
if (r)
|
||||
return r;
|
||||
if (!bpf_res_spin_lock(&elem2->lock)) {
|
||||
bpf_res_spin_unlock(&elem2->lock);
|
||||
bpf_res_spin_unlock(&elem1->lock);
|
||||
return -1;
|
||||
}
|
||||
bpf_res_spin_unlock(&elem1->lock);
|
||||
return 0;
|
||||
}
|
||||
|
||||
SEC("tc")
|
||||
int res_spin_lock_test_AB(struct __sk_buff *ctx)
|
||||
{
|
||||
int r;
|
||||
|
||||
r = bpf_res_spin_lock(&lockA);
|
||||
if (r)
|
||||
return !r;
|
||||
/* Only unlock if we took the lock. */
|
||||
if (!bpf_res_spin_lock(&lockB))
|
||||
bpf_res_spin_unlock(&lockB);
|
||||
bpf_res_spin_unlock(&lockA);
|
||||
return 0;
|
||||
}
|
||||
|
||||
int err;
|
||||
|
||||
SEC("tc")
|
||||
int res_spin_lock_test_BA(struct __sk_buff *ctx)
|
||||
{
|
||||
int r;
|
||||
|
||||
r = bpf_res_spin_lock(&lockB);
|
||||
if (r)
|
||||
return !r;
|
||||
if (!bpf_res_spin_lock(&lockA))
|
||||
bpf_res_spin_unlock(&lockA);
|
||||
else
|
||||
err = -EDEADLK;
|
||||
bpf_res_spin_unlock(&lockB);
|
||||
return err ?: 0;
|
||||
}
|
||||
|
||||
SEC("tc")
|
||||
int res_spin_lock_test_held_lock_max(struct __sk_buff *ctx)
|
||||
{
|
||||
struct bpf_res_spin_lock *locks[48] = {};
|
||||
struct arr_elem *e;
|
||||
u64 time_beg, time;
|
||||
int ret = 0, i;
|
||||
|
||||
_Static_assert(ARRAY_SIZE(((struct rqspinlock_held){}).locks) == 31,
|
||||
"RES_NR_HELD assumed to be 31");
|
||||
|
||||
for (i = 0; i < 34; i++) {
|
||||
int key = i;
|
||||
|
||||
/* We cannot pass in i as it will get spilled/filled by the compiler and
|
||||
* loses bounds in verifier state.
|
||||
*/
|
||||
e = bpf_map_lookup_elem(&arrmap, &key);
|
||||
if (!e)
|
||||
return 1;
|
||||
locks[i] = &e->lock;
|
||||
}
|
||||
|
||||
for (; i < 48; i++) {
|
||||
int key = i - 2;
|
||||
|
||||
/* We cannot pass in i as it will get spilled/filled by the compiler and
|
||||
* loses bounds in verifier state.
|
||||
*/
|
||||
e = bpf_map_lookup_elem(&arrmap, &key);
|
||||
if (!e)
|
||||
return 1;
|
||||
locks[i] = &e->lock;
|
||||
}
|
||||
|
||||
time_beg = bpf_ktime_get_ns();
|
||||
for (i = 0; i < 34; i++) {
|
||||
if (bpf_res_spin_lock(locks[i]))
|
||||
goto end;
|
||||
}
|
||||
|
||||
/* Trigger AA, after exhausting entries in the held lock table. This
|
||||
* time, only the timeout can save us, as AA detection won't succeed.
|
||||
*/
|
||||
if (!bpf_res_spin_lock(locks[34])) {
|
||||
bpf_res_spin_unlock(locks[34]);
|
||||
ret = 1;
|
||||
goto end;
|
||||
}
|
||||
|
||||
end:
|
||||
for (i = i - 1; i >= 0; i--)
|
||||
bpf_res_spin_unlock(locks[i]);
|
||||
time = bpf_ktime_get_ns() - time_beg;
|
||||
/* Time spent should be easily above our limit (1/4 s), since AA
|
||||
* detection won't be expedited due to lack of held lock entry.
|
||||
*/
|
||||
return ret ?: (time > 1000000000 / 4 ? 0 : 1);
|
||||
}
|
||||
|
||||
char _license[] SEC("license") = "GPL";
|
244
tools/testing/selftests/bpf/progs/res_spin_lock_fail.c
Normal file
244
tools/testing/selftests/bpf/progs/res_spin_lock_fail.c
Normal file
|
@ -0,0 +1,244 @@
|
|||
// SPDX-License-Identifier: GPL-2.0
|
||||
/* Copyright (c) 2024-2025 Meta Platforms, Inc. and affiliates. */
|
||||
#include <vmlinux.h>
|
||||
#include <bpf/bpf_tracing.h>
|
||||
#include <bpf/bpf_helpers.h>
|
||||
#include <bpf/bpf_core_read.h>
|
||||
#include "bpf_misc.h"
|
||||
#include "bpf_experimental.h"
|
||||
|
||||
struct arr_elem {
|
||||
struct bpf_res_spin_lock lock;
|
||||
};
|
||||
|
||||
struct {
|
||||
__uint(type, BPF_MAP_TYPE_ARRAY);
|
||||
__uint(max_entries, 1);
|
||||
__type(key, int);
|
||||
__type(value, struct arr_elem);
|
||||
} arrmap SEC(".maps");
|
||||
|
||||
long value;
|
||||
|
||||
struct bpf_spin_lock lock __hidden SEC(".data.A");
|
||||
struct bpf_res_spin_lock res_lock __hidden SEC(".data.B");
|
||||
|
||||
SEC("?tc")
|
||||
__failure __msg("point to map value or allocated object")
|
||||
int res_spin_lock_arg(struct __sk_buff *ctx)
|
||||
{
|
||||
struct arr_elem *elem;
|
||||
|
||||
elem = bpf_map_lookup_elem(&arrmap, &(int){0});
|
||||
if (!elem)
|
||||
return 0;
|
||||
bpf_res_spin_lock((struct bpf_res_spin_lock *)bpf_core_cast(&elem->lock, struct __sk_buff));
|
||||
bpf_res_spin_lock(&elem->lock);
|
||||
return 0;
|
||||
}
|
||||
|
||||
SEC("?tc")
|
||||
__failure __msg("AA deadlock detected")
|
||||
int res_spin_lock_AA(struct __sk_buff *ctx)
|
||||
{
|
||||
struct arr_elem *elem;
|
||||
|
||||
elem = bpf_map_lookup_elem(&arrmap, &(int){0});
|
||||
if (!elem)
|
||||
return 0;
|
||||
bpf_res_spin_lock(&elem->lock);
|
||||
bpf_res_spin_lock(&elem->lock);
|
||||
return 0;
|
||||
}
|
||||
|
||||
SEC("?tc")
|
||||
__failure __msg("AA deadlock detected")
|
||||
int res_spin_lock_cond_AA(struct __sk_buff *ctx)
|
||||
{
|
||||
struct arr_elem *elem;
|
||||
|
||||
elem = bpf_map_lookup_elem(&arrmap, &(int){0});
|
||||
if (!elem)
|
||||
return 0;
|
||||
if (bpf_res_spin_lock(&elem->lock))
|
||||
return 0;
|
||||
bpf_res_spin_lock(&elem->lock);
|
||||
return 0;
|
||||
}
|
||||
|
||||
SEC("?tc")
|
||||
__failure __msg("unlock of different lock")
|
||||
int res_spin_lock_mismatch_1(struct __sk_buff *ctx)
|
||||
{
|
||||
struct arr_elem *elem;
|
||||
|
||||
elem = bpf_map_lookup_elem(&arrmap, &(int){0});
|
||||
if (!elem)
|
||||
return 0;
|
||||
if (bpf_res_spin_lock(&elem->lock))
|
||||
return 0;
|
||||
bpf_res_spin_unlock(&res_lock);
|
||||
return 0;
|
||||
}
|
||||
|
||||
SEC("?tc")
|
||||
__failure __msg("unlock of different lock")
|
||||
int res_spin_lock_mismatch_2(struct __sk_buff *ctx)
|
||||
{
|
||||
struct arr_elem *elem;
|
||||
|
||||
elem = bpf_map_lookup_elem(&arrmap, &(int){0});
|
||||
if (!elem)
|
||||
return 0;
|
||||
if (bpf_res_spin_lock(&res_lock))
|
||||
return 0;
|
||||
bpf_res_spin_unlock(&elem->lock);
|
||||
return 0;
|
||||
}
|
||||
|
||||
SEC("?tc")
|
||||
__failure __msg("unlock of different lock")
|
||||
int res_spin_lock_irq_mismatch_1(struct __sk_buff *ctx)
|
||||
{
|
||||
struct arr_elem *elem;
|
||||
unsigned long f1;
|
||||
|
||||
elem = bpf_map_lookup_elem(&arrmap, &(int){0});
|
||||
if (!elem)
|
||||
return 0;
|
||||
bpf_local_irq_save(&f1);
|
||||
if (bpf_res_spin_lock(&res_lock))
|
||||
return 0;
|
||||
bpf_res_spin_unlock_irqrestore(&res_lock, &f1);
|
||||
return 0;
|
||||
}
|
||||
|
||||
SEC("?tc")
|
||||
__failure __msg("unlock of different lock")
|
||||
int res_spin_lock_irq_mismatch_2(struct __sk_buff *ctx)
|
||||
{
|
||||
struct arr_elem *elem;
|
||||
unsigned long f1;
|
||||
|
||||
elem = bpf_map_lookup_elem(&arrmap, &(int){0});
|
||||
if (!elem)
|
||||
return 0;
|
||||
if (bpf_res_spin_lock_irqsave(&res_lock, &f1))
|
||||
return 0;
|
||||
bpf_res_spin_unlock(&res_lock);
|
||||
return 0;
|
||||
}
|
||||
|
||||
SEC("?tc")
|
||||
__success
|
||||
int res_spin_lock_ooo(struct __sk_buff *ctx)
|
||||
{
|
||||
struct arr_elem *elem;
|
||||
|
||||
elem = bpf_map_lookup_elem(&arrmap, &(int){0});
|
||||
if (!elem)
|
||||
return 0;
|
||||
if (bpf_res_spin_lock(&res_lock))
|
||||
return 0;
|
||||
if (bpf_res_spin_lock(&elem->lock)) {
|
||||
bpf_res_spin_unlock(&res_lock);
|
||||
return 0;
|
||||
}
|
||||
bpf_res_spin_unlock(&elem->lock);
|
||||
bpf_res_spin_unlock(&res_lock);
|
||||
return 0;
|
||||
}
|
||||
|
||||
SEC("?tc")
|
||||
__success
|
||||
int res_spin_lock_ooo_irq(struct __sk_buff *ctx)
|
||||
{
|
||||
struct arr_elem *elem;
|
||||
unsigned long f1, f2;
|
||||
|
||||
elem = bpf_map_lookup_elem(&arrmap, &(int){0});
|
||||
if (!elem)
|
||||
return 0;
|
||||
if (bpf_res_spin_lock_irqsave(&res_lock, &f1))
|
||||
return 0;
|
||||
if (bpf_res_spin_lock_irqsave(&elem->lock, &f2)) {
|
||||
bpf_res_spin_unlock_irqrestore(&res_lock, &f1);
|
||||
/* We won't have a unreleased IRQ flag error here. */
|
||||
return 0;
|
||||
}
|
||||
bpf_res_spin_unlock_irqrestore(&elem->lock, &f2);
|
||||
bpf_res_spin_unlock_irqrestore(&res_lock, &f1);
|
||||
return 0;
|
||||
}
|
||||
|
||||
struct bpf_res_spin_lock lock1 __hidden SEC(".data.OO1");
|
||||
struct bpf_res_spin_lock lock2 __hidden SEC(".data.OO2");
|
||||
|
||||
SEC("?tc")
|
||||
__failure __msg("bpf_res_spin_unlock cannot be out of order")
|
||||
int res_spin_lock_ooo_unlock(struct __sk_buff *ctx)
|
||||
{
|
||||
if (bpf_res_spin_lock(&lock1))
|
||||
return 0;
|
||||
if (bpf_res_spin_lock(&lock2)) {
|
||||
bpf_res_spin_unlock(&lock1);
|
||||
return 0;
|
||||
}
|
||||
bpf_res_spin_unlock(&lock1);
|
||||
bpf_res_spin_unlock(&lock2);
|
||||
return 0;
|
||||
}
|
||||
|
||||
SEC("?tc")
|
||||
__failure __msg("off 1 doesn't point to 'struct bpf_res_spin_lock' that is at 0")
|
||||
int res_spin_lock_bad_off(struct __sk_buff *ctx)
|
||||
{
|
||||
struct arr_elem *elem;
|
||||
|
||||
elem = bpf_map_lookup_elem(&arrmap, &(int){0});
|
||||
if (!elem)
|
||||
return 0;
|
||||
bpf_res_spin_lock((void *)&elem->lock + 1);
|
||||
return 0;
|
||||
}
|
||||
|
||||
SEC("?tc")
|
||||
__failure __msg("R1 doesn't have constant offset. bpf_res_spin_lock has to be at the constant offset")
|
||||
int res_spin_lock_var_off(struct __sk_buff *ctx)
|
||||
{
|
||||
struct arr_elem *elem;
|
||||
u64 val = value;
|
||||
|
||||
elem = bpf_map_lookup_elem(&arrmap, &(int){0});
|
||||
if (!elem) {
|
||||
// FIXME: Only inline assembly use in assert macro doesn't emit
|
||||
// BTF definition.
|
||||
bpf_throw(0);
|
||||
return 0;
|
||||
}
|
||||
bpf_assert_range(val, 0, 40);
|
||||
bpf_res_spin_lock((void *)&value + val);
|
||||
return 0;
|
||||
}
|
||||
|
||||
SEC("?tc")
|
||||
__failure __msg("map 'res_spin.bss' has no valid bpf_res_spin_lock")
|
||||
int res_spin_lock_no_lock_map(struct __sk_buff *ctx)
|
||||
{
|
||||
bpf_res_spin_lock((void *)&value + 1);
|
||||
return 0;
|
||||
}
|
||||
|
||||
SEC("?tc")
|
||||
__failure __msg("local 'kptr' has no valid bpf_res_spin_lock")
|
||||
int res_spin_lock_no_lock_kptr(struct __sk_buff *ctx)
|
||||
{
|
||||
struct { int i; } *p = bpf_obj_new(typeof(*p));
|
||||
|
||||
if (!p)
|
||||
return 0;
|
||||
bpf_res_spin_lock((void *)p);
|
||||
return 0;
|
||||
}
|
||||
|
||||
char _license[] SEC("license") = "GPL";
|
Loading…
Add table
Reference in a new issue