gh-144586: Improve _Py_yield to improve light weight cpu instruction by corona10 · Pull Request #144587 · python/cpython

corona10 · 2026-02-08T07:34:30Z

Issue: Improve _Py_yield to use light weight cpu instruction #144586

…ction

corona10 · 2026-02-08T07:39:52Z

Benchmark on my Mac mini (consistenty enhanced)

baseline

Python: 3.15.0a5+ free-threading build (heads/gh-115697:e682141c495, Feb  8 2026, 16:38:00) [Clang 17.0.0 (clang-1700.6.3.2)]
GIL enabled: False
CPUs: 12

2 threads: 0.0284s  (14,103,668 ops/sec)
4 threads: 0.0728s  (10,988,690 ops/sec)
8 threads: 0.3063s  (5,223,362 ops/sec)

with PR

➜  cpython git:(gh-144586) ✗ ./python.exe bench_mutex_contention.py
Python: 3.15.0a5+ free-threading build (heads/gh-144586:21bd43c7e5e, Feb  8 2026, 16:34:31) [Clang 17.0.0 (clang-1700.6.3.2)]
GIL enabled: False
CPUs: 12

2 threads: 0.0239s  (16,738,824 ops/sec)
4 threads: 0.0559s  (14,300,174 ops/sec)
8 threads: 0.1813s  (8,824,965 ops/sec)

script

import threading
import time
import sys
import os

NUM_THREADS_LIST = [2, 4, 8]
OPS_PER_THREAD = 200_000
ROUNDS = 3


def contention_bench(num_threads, ops):
    lock = threading.Lock()
    total = [0]
    barrier = threading.Barrier(num_threads + 1)

    def worker():
        barrier.wait()
        for _ in range(ops):
            with lock:
                total[0] += 1

    threads = [threading.Thread(target=worker) for _ in range(num_threads)]
    for t in threads:
        t.start()
    barrier.wait()
    t0 = time.perf_counter()
    for t in threads:
        t.join()
    return time.perf_counter() - t0, total[0]


if __name__ == "__main__":
    print(f"Python: {sys.version}")
    if hasattr(sys, "_is_gil_enabled"):
        print(f"GIL enabled: {sys._is_gil_enabled()}")
    print(f"CPUs: {os.cpu_count()}\n")

    for nt in NUM_THREADS_LIST:
        best = float("inf")
        for _ in range(ROUNDS):
            elapsed, total = contention_bench(nt, OPS_PER_THREAD)
            best = min(best, elapsed)
        print(f"{nt} threads: {best:.4f}s  ({total/best:,.0f} ops/sec)")

corona10 · 2026-02-08T13:39:58Z

-extern void _Py_yield(void);
+// Lightweight CPU pause hint for spin-wait loops (e.g., x86 PAUSE, AArch64 WFE).
+// Falls back to sched_yield() on platforms without a known pause instruction.
+static inline void


I made it static inline because the function call overhead is more expensive than a single instruction.

This function is only used in lock.c, why move it to header?

~~Ah yeah, we can move back to lock.c~~

umm no

cpython/Objects/genobject.c

Line 46 in d736349

_Py_yield();

I see, I was looking at older checkout of main branch. Making it static inline looks fine although I think LTO would have inlined it anyways.

I think LTO would have inlined it anyways.

I think that same way, but just follow our old convention :)

corona10 · 2026-02-08T13:47:12Z

+#elif defined(_M_X64) || defined(_M_IX86)
+    _mm_pause();
+#elif defined(_M_ARM64) || defined(_M_ARM)
+    __yield();


See: https://learn.microsoft.com/en-us/cpp/intrinsics/arm64-intrinsics?view=msvc-170

colesbury · 2026-02-08T15:07:13Z

A lightweight pause is usually not what we want here. See https://webkit.org/blog/6161/locking-in-webkit/ for some discussions on yield vs pause. We also have an unresolved issue where we are only spinning for one iteration, but changing that seems to hurt performance. Anyways, I think we should be careful in th changes we make here.

…

On Sun, Feb 8, 2026 at 7:16 AM Donghee Na ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In Include/internal/pycore_lock.h <#144587 (comment)>: > @@ -70,8 +74,25 @@ PyMutex_LockFlags(PyMutex *m, _PyLockFlags flags) // error messages) otherwise returns 0. extern int _PyMutex_TryUnlock(PyMutex *m); -// Yield the processor to other threads (e.g., sched_yield). -extern void _Py_yield(void); +// Lightweight CPU pause hint for spin-wait loops (e.g., x86 PAUSE, AArch64 WFE). +// Falls back to sched_yield() on platforms without a known pause instruction. +static inline void I think LTO would have inlined it anyways. I think that same way, but just follow our old convention :) — Reply to this email directly, view it on GitHub <#144587 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAFAD6USQWH6RK6X4MNGBOT4K5ALBAVCNFSM6AAAAACUL2ZTD2VHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZTONRZHAYDSMJYGQ> . You are receiving this because your review was requested.Message ID: ***@***.***>

corona10 · 2026-02-08T15:17:27Z

We also have an unresolved issue where we are only spinning for one
iteration, but changing that seems to hurt performance.

Just for sharing: From my micro bechmark and ft-scailing benchmark doesn't show negotive impact from this change.

colesbury · 2026-02-19T19:46:30Z

I like the idea of using wfe on arm64, but I think this needs a bunch more work:

wfe waits until an exclusive monitor is cleared (or some other event). For spinning, we need to pair it with LDXR or LDAXR. The earlier C11 atomic instructions might be generating that, but I don't think we can rely on it.
wfe doesn't handle timeouts and this code path needs to. We can limit it to cases where timeout<0. There's wfet, but that's only on ARMv8.7+.

Maybe we can tackle one architecture at a time: arm64 (esp. Apple Silicon), then x86-64, then other ones.

dpdani · 2026-04-16T10:08:55Z

I have been messing around with avoiding to use sched_yield and I've found that on my macbook even a single wfe instruction introduces a very long delay, which significantly decreases throughput when a lock is not highly contended, differently from the program above.

I have found more success for both high- and low-contention scenarios by issuing an exponential amount of yield instructions in _PyMutex_LockTimed. This also removes the problem of timeouts, since yield will not introduce indefinite waits.

BTW, other VMs also use isb, but that seems to introduce contention which decreases throughput of highly-contended locks when tested with PyMutex, so we probably don't want to go with that either.

pythongh-144586: Improve _Py_yield to improve light weight cpu instru…

21bd43c

…ction

bedevere-app Bot added the awaiting core review label Feb 8, 2026

bedevere-app Bot mentioned this pull request Feb 8, 2026

Improve _Py_yield to use light weight cpu instruction #144586

Open

corona10 added the skip news label Feb 8, 2026

corona10 requested review from ZeroIntensity and colesbury February 8, 2026 07:34

corona10 commented Feb 8, 2026

View reviewed changes

Comment thread Python/lock.c Outdated

Add NEWS.d

9b90b96

corona10 removed the skip news label Feb 8, 2026

corona10 requested a review from vstinner February 8, 2026 07:43

Address code review

d1a986c

corona10 requested a review from kumaraditya303 February 8, 2026 13:38

corona10 commented Feb 8, 2026

View reviewed changes

Fix for windows

2fe2bbb

corona10 commented Feb 8, 2026

View reviewed changes

corona10 added performance Performance or resource usage topic-free-threading labels Feb 8, 2026

nit

29ebf1b

Uh oh!

Conversation

corona10 commented Feb 8, 2026 • edited by bedevere-app Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

corona10 commented Feb 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

baseline

with PR

script

Uh oh!

corona10 Feb 8, 2026

Choose a reason for hiding this comment

Uh oh!

kumaraditya303 Feb 8, 2026

Choose a reason for hiding this comment

Uh oh!

corona10 Feb 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

corona10 Feb 8, 2026

Choose a reason for hiding this comment

Uh oh!

kumaraditya303 Feb 8, 2026

Choose a reason for hiding this comment

Uh oh!

corona10 Feb 8, 2026

Choose a reason for hiding this comment

Uh oh!

corona10 Feb 8, 2026

Choose a reason for hiding this comment

Uh oh!

colesbury commented Feb 8, 2026 via email

Uh oh!

corona10 commented Feb 8, 2026

Uh oh!

colesbury commented Feb 19, 2026

Uh oh!

dpdani commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

corona10 commented Feb 8, 2026 •

edited by bedevere-app Bot

Loading

corona10 commented Feb 8, 2026 •

edited

Loading

corona10 Feb 8, 2026 •

edited

Loading