Skip to content

gh-144586: Improve _Py_yield to improve light weight cpu instruction#144587

Open
corona10 wants to merge 5 commits into
python:mainfrom
corona10:gh-144586
Open

gh-144586: Improve _Py_yield to improve light weight cpu instruction#144587
corona10 wants to merge 5 commits into
python:mainfrom
corona10:gh-144586

Conversation

@corona10
Copy link
Copy Markdown
Member

@corona10 corona10 commented Feb 8, 2026

Comment thread Python/lock.c Outdated
@corona10
Copy link
Copy Markdown
Member Author

corona10 commented Feb 8, 2026

Benchmark on my Mac mini (consistenty enhanced)

baseline

Python: 3.15.0a5+ free-threading build (heads/gh-115697:e682141c495, Feb  8 2026, 16:38:00) [Clang 17.0.0 (clang-1700.6.3.2)]
GIL enabled: False
CPUs: 12

2 threads: 0.0284s  (14,103,668 ops/sec)
4 threads: 0.0728s  (10,988,690 ops/sec)
8 threads: 0.3063s  (5,223,362 ops/sec)

with PR

➜  cpython git:(gh-144586) ✗ ./python.exe bench_mutex_contention.py
Python: 3.15.0a5+ free-threading build (heads/gh-144586:21bd43c7e5e, Feb  8 2026, 16:34:31) [Clang 17.0.0 (clang-1700.6.3.2)]
GIL enabled: False
CPUs: 12

2 threads: 0.0239s  (16,738,824 ops/sec)
4 threads: 0.0559s  (14,300,174 ops/sec)
8 threads: 0.1813s  (8,824,965 ops/sec)

script

import threading
import time
import sys
import os

NUM_THREADS_LIST = [2, 4, 8]
OPS_PER_THREAD = 200_000
ROUNDS = 3


def contention_bench(num_threads, ops):
    lock = threading.Lock()
    total = [0]
    barrier = threading.Barrier(num_threads + 1)

    def worker():
        barrier.wait()
        for _ in range(ops):
            with lock:
                total[0] += 1

    threads = [threading.Thread(target=worker) for _ in range(num_threads)]
    for t in threads:
        t.start()
    barrier.wait()
    t0 = time.perf_counter()
    for t in threads:
        t.join()
    return time.perf_counter() - t0, total[0]


if __name__ == "__main__":
    print(f"Python: {sys.version}")
    if hasattr(sys, "_is_gil_enabled"):
        print(f"GIL enabled: {sys._is_gil_enabled()}")
    print(f"CPUs: {os.cpu_count()}\n")

    for nt in NUM_THREADS_LIST:
        best = float("inf")
        for _ in range(ROUNDS):
            elapsed, total = contention_bench(nt, OPS_PER_THREAD)
            best = min(best, elapsed)
        print(f"{nt} threads: {best:.4f}s  ({total/best:,.0f} ops/sec)")

@corona10 corona10 removed the skip news label Feb 8, 2026
@corona10 corona10 requested a review from vstinner February 8, 2026 07:43
extern void _Py_yield(void);
// Lightweight CPU pause hint for spin-wait loops (e.g., x86 PAUSE, AArch64 WFE).
// Falls back to sched_yield() on platforms without a known pause instruction.
static inline void
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made it static inline because the function call overhead is more expensive than a single instruction.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is only used in lock.c, why move it to header?

Copy link
Copy Markdown
Member Author

@corona10 corona10 Feb 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah, we can move back to lock.c

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

umm no

_Py_yield();

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, I was looking at older checkout of main branch. Making it static inline looks fine although I think LTO would have inlined it anyways.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think LTO would have inlined it anyways.

I think that same way, but just follow our old convention :)

#elif defined(_M_X64) || defined(_M_IX86)
_mm_pause();
#elif defined(_M_ARM64) || defined(_M_ARM)
__yield();
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@corona10 corona10 added performance Performance or resource usage topic-free-threading labels Feb 8, 2026
@colesbury
Copy link
Copy Markdown
Contributor

colesbury commented Feb 8, 2026 via email

@corona10
Copy link
Copy Markdown
Member Author

corona10 commented Feb 8, 2026

We also have an unresolved issue where we are only spinning for one
iteration, but changing that seems to hurt performance.

Just for sharing: From my micro bechmark and ft-scailing benchmark doesn't show negotive impact from this change.

@colesbury
Copy link
Copy Markdown
Contributor

I like the idea of using wfe on arm64, but I think this needs a bunch more work:

  • wfe waits until an exclusive monitor is cleared (or some other event). For spinning, we need to pair it with LDXR or LDAXR. The earlier C11 atomic instructions might be generating that, but I don't think we can rely on it.
  • wfe doesn't handle timeouts and this code path needs to. We can limit it to cases where timeout<0. There's wfet, but that's only on ARMv8.7+.

Maybe we can tackle one architecture at a time: arm64 (esp. Apple Silicon), then x86-64, then other ones.

@dpdani
Copy link
Copy Markdown
Contributor

dpdani commented Apr 16, 2026

I have been messing around with avoiding to use sched_yield and I've found that on my macbook even a single wfe instruction introduces a very long delay, which significantly decreases throughput when a lock is not highly contended, differently from the program above.

I have found more success for both high- and low-contention scenarios by issuing an exponential amount of yield instructions in _PyMutex_LockTimed. This also removes the problem of timeouts, since yield will not introduce indefinite waits.

BTW, other VMs also use isb, but that seems to introduce contention which decreases throughput of highly-contended locks when tested with PyMutex, so we probably don't want to go with that either.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants