Skip to content

pyronova: bump to v2.3.1, unlock sub-interp DB bridge + size GIL pool#623

Open
ddxd wants to merge 4 commits intoMDA2AV:mainfrom
moomoo-tech:pyronova-v2.3.1
Open

pyronova: bump to v2.3.1, unlock sub-interp DB bridge + size GIL pool#623
ddxd wants to merge 4 commits intoMDA2AV:mainfrom
moomoo-tech:pyronova-v2.3.1

Conversation

@ddxd
Copy link
Copy Markdown
Contributor

@ddxd ddxd commented Apr 23, 2026

Updates Pyronova from v2.2.0 → v2.3.1 via PYRONOVA_REF + a small launcher tweak. No app.py changes.

What changed in Pyronova between v2.2.0 and v2.3.1

1. Sub-interpreter DB bridge now works under TPC (src/bridge/db_bridge.rs)

The bridge existed in v2.2 but panicked the moment a route without gil=True hit it under TPC mode:

thread 'pyronova-tpc-N' panicked at tokio ... multi_thread/mod.rs:88:
Cannot start a runtime from within a runtime.

Cause: the bridge's C-FFI entry points called rt.block_on(fut) on the dedicated DB runtime, from inside the TPC worker thread's own current_thread tokio runtime. tokio forbids nested block_on.

Fix: rt.spawn(fut) + std::sync::mpsc::sync_channel + rx.recv(). spawn has no runtime-context check — it just queues the task onto the DB runtime's worker pool. The sub-interp worker blocks on the channel with the GIL released (py.detach), so peer sub-interpreters keep running during the query. Parallelism ceiling is min(sub_interp_workers, DATABASE_MAX_CONN) instead of the single main-interp thread.

2. Main-interp gil=True bridge defaults to N-worker pool (src/bridge/main_bridge.rs)

Used by the crud routes (their cache-aside dict semantics require a single interpreter — sub-interp workers have independent heaps, so SO_REUSEPORT would route consecutive GET /crud/items/42 hits to different workers and the HttpArena cache-aside validator would never see MISSHIT).

Previously a single std::thread; v2.3 is a crossbeam::bounded MPMC queue served by N threads. This PR's launcher change sets PYRONOVA_GIL_BRIDGE_WORKERS=16 + PYRONOVA_GIL_BRIDGE_CAPACITY=8192 so the 1024–4096-conn profiles don't 503-storm on a 64-deep default channel.

3. TPC becomes the default dispatch mode

Flipped in 0ae579c upstream. The Arena leaderboard's current v2.2.0 numbers are from hybrid mode (N sub-interp pool + N io threads, with per-request crossbeam_channel dispatch across workers). TPC replaces that with per-thread sub-interpreter + same-thread handler call — zero cross-thread wake, zero cross-CCD atomic contention on the hot path. On the Arena 32-physical-core EPYC, this should lift baseline / short-lived / json proportionally, not just async-db.

Measured impact (local)

Linux 7840HS 16-thread, Postgres sidecar, wrk -t8:

Profile v2.2.0 (hybrid) v2.3.1 (TPC) Δ
/async-db @ c=64 3.7k 30k ≈8×
/async-db @ c=1024 3.7k 35k ≈9×
/async-db @ c=4096 3.7k 34k ≈9×

All profiles: 0 non-2xx across c=64..4096.

validate.sh pyronova: 49 passed, 0 failed (verified locally against the current scripts/validate.sh).

Compatibility

  • No app.py changes. The previously-submitted /async-db handler (without gil=True) and crud handlers (with gil=True) both work unchanged with the v2.3.1 engine.
  • Pyronova's gil=True contract is preserved: pydantic-core, numpy, and any other main-interp-only extension still works.
  • The engine is source-cloned from https://github.com/moomoo-tech/pyronova at tag v2.3.1 (existing Dockerfile pattern, just bumping the ref).

🤖 Generated with Claude Code

ddxd added 4 commits April 22, 2026 18:21
Changes:
- PYRONOVA_REF: v2.0.2 -> v2.1.5
- app.py: add warning-level logging on benchmark-path errors; fix CRUD
  endpoint paths/response shape to match aspnet-minimal reference;
  restore gil=True on async-db (PG_POOL lives on main interp); widen
  upload limit to 25MB
- launcher.py: NUMA-aware io_workers cap (avoid oversubscription on
  multi-socket boxes)
- meta.json: subscribe to crud + unary-grpc{,-tls} + api-{4,16}

v2.1.5 highlights (see moomoo-tech/pyronova CHANGELOG):
- Per-worker sharded channels replacing the single MPMC crossbeam queue
  (eliminates cross-CCD cache-line bouncing on AMD multi-CCD boxes)
- TCP_DEFER_ACCEPT, slowloris/HPP hardening, Py_SETREF correctness fix
- In-flight-aware P2C load balancing
- HOL body streaming + bounded WS channel
Arena's validate.sh reads `X-Cache` (MISS/HIT) via
`curl | grep ^x-cache:` under `set -o pipefail`. Without the header
the pipeline fails silently and terminates the whole script before
fail_with_link has a chance to print anything — exactly what we saw on
the last CI run ("PASS [GET /crud/items/1]" → cleanup → exit 1, no
diagnostic in between).

Fix: wrap every crud_get_one return path in Response with the
appropriate X-Cache header (MISS on first fetch / error / 404, HIT on
cache-aside return). Cache now stores a pre-serialized JSON string so
the HIT path skips json.dumps on every hit.

No behavior change for any other profile.
v2.2.0 adds a C-FFI DB bridge (4 functions injected into every
sub-interp's globals) that forwards sqlx calls onto the shared
process-global pool while releasing the calling interp's GIL. This
removes the single-GIL ceiling that was capping /async-db at 3.7k rps
on the previous v2.1.5 run.

/crud/* endpoints keep gil=True for now — their in-process dict cache
relies on main-interp GIL serialization for the MISS→HIT semantics the
validator checks. Moving that cache to a SharedState-backed DashMap
unblocks /crud too, tracked as a v2.3 follow-up.

Expected impact this run (64-core TR 3995WX):
- async-db: 3.7k → 30-50k rps (bridge ceiling ≈ min(cores, PG max_conn))
- api-4 / api-16: partial improvement (CRUD sub-profile still gil=True)
- Other profiles: unchanged

See docs/arena-async-db-and-static.md in pyronova repo for the full
design doc.
Updates Pyronova from v2.2.0 → v2.3.1 via PYRONOVA_REF. Two changes
in the Pyronova engine itself that affect Arena numbers:

1. Sub-interpreter DB bridge now works under TPC
   (src/bridge/db_bridge.rs). The bridge existed in v2.2 but panicked
   under TPC mode with "Cannot start a runtime from within a runtime":
   `rt.block_on(fut)` inside each sub-interp worker's tokio
   current_thread runtime is forbidden by tokio. Fixed by
   channel-dispatching to the DB runtime (`rt.spawn + std::mpsc::recv`)
   instead of nested block_on. The existing /async-db handler (no
   gil=True) now scales across all sub-interps with independent GILs
   instead of 503-ing on the single-thread main bridge.

2. Main-interp gil=True bridge defaults to N-worker pool
   (src/bridge/main_bridge.rs). Used by crud routes (their
   cache-aside dict semantics require a single interpreter). The
   launcher sets PYRONOVA_GIL_BRIDGE_WORKERS=16 + CAPACITY=8192 so
   the 1024+ concurrency profiles don't overflow the default 64-deep
   channel with a 503 storm. Verified locally at c=4096: 15k req/s
   steady, 0 drops.

Both fixes preserve Pyronova's gil=True contract — pydantic-core /
numpy / any other main-interp-only extension still works unchanged.

Measured locally (Linux 7840HS 16-thread, PG sidecar, wrk -t8):
  /async-db @ c=4096: 3.7k (v2.2.0) → 34k req/s (v2.3.1)   ≈9×
  /async-db @ c=64:   3.7k          → 30k req/s            ≈8×
  All profiles: 0 non-2xx responses across c=64..4096.

TPC also becomes the default dispatch mode in v2.3.x (flipped in
0ae579c upstream). The Arena leaderboard's current v2.2.0 numbers
were hybrid-mode; TPC's per-core pinning + leaked route tables
should give a proportional lift to baseline / short-lived / json
profiles too, not just async-db.

validate.sh pyronova locally: 49 passed, 0 failed.
@MDA2AV
Copy link
Copy Markdown
Owner

MDA2AV commented Apr 25, 2026

/benchmark -f pyronova --save

@github-actions
Copy link
Copy Markdown
Contributor

👋 /benchmark request received. A collaborator will review and approve the run.

@github-actions
Copy link
Copy Markdown
Contributor

⚠️ /benchmark --save aborted: main has diverged and cannot be auto-merged into this branch. Please merge or rebase main manually, push, and re-run /benchmark --save.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants