Skip to content

perf(vanilla-io_uring): converge onto the epoll twin — zero-alloc rendering, /pipeline skip-decode, crud slab (impl vanilla#84, #85)#965

Open
enghitalo wants to merge 1 commit into
MDA2AV:mainfrom
enghitalo:perf/vanilla-io_uring-converge-epoll
Open

perf(vanilla-io_uring): converge onto the epoll twin — zero-alloc rendering, /pipeline skip-decode, crud slab (impl vanilla#84, #85)#965
enghitalo wants to merge 1 commit into
MDA2AV:mainfrom
enghitalo:perf/vanilla-io_uring-converge-epoll

Conversation

@enghitalo

Copy link
Copy Markdown
Contributor

What & why

vanilla-io_uring was an under-optimized copy of its vanilla-epoll twin: same handlers, but allocating throwaway strings per request, using json.encode/json.decode reflection on the DB paths, and parsing the full request even for the fixed /pipeline blit. This PR ports every backend-agnostic optimization from the epoll entry so the two share one audited set of response builders and diff cleanly — the stated goal of keeping the two entries easy to maintain in lock-step.

The io_uring backend of the vanilla library supports only a stateless request_handler (no async_handler / make_state / TLS — see enghitalo/vanilla#83, enghitalo/vanilla#93), so DB access necessarily stays on the blocking db.pg client. Everything that does not require the async runtime or per-worker state now matches epoll byte-for-byte.

Implements enghitalo/vanilla#84 (zero-alloc int parse/format) and enghitalo/vanilla#85 (crud: 1-query list, byte-rendered GET, fast crud-body parse).

Changes (all entry-only, no lib change)

  • wi is now negative-aware — fixes a latent wrong body for a negative /baseline11 sum (a=-10&b=3 returned garbage; now -7).
  • emit / emit_int (stack scratch) / emit_xcache — zero-alloc response framing. /baseline11 and /upload no longer allocate an int -> string per request (the highest-RPS non-DB profiles).
  • /pipeline skip-decode fast path — blit the constant before any parsing; decode_into (no !HttpRequest Result boxing) on the main parse path.
  • render_item_pg — byte-level JSON straight from db.pg text rows, removing the per-request json.encode reflection on /async-db, /crud list and /crud GET.
  • crud cache is now an id-indexed slab (replaces map[int]string) with in-place buffer reuse and cache-aside invalidation, shared across ring workers under RwMutex — identical structure to the epoll twin.
  • crud_list uses a single windowed query (count(*) OVER()) instead of a page SELECT + a separate count(*).
  • parse_crud_body_fast + borrowed JSON field parsers (with the json.decode fallback kept for escaped bodies).
  • parse_i64_slice / dechunk_into / parse_hex_slice — allocation-free query/body parsing (qint no longer materializes a string per param).
  • static: sendfile_min_bytes = 16 KiB, matching epoll (bounds per-connection RSS at high conn counts).

Net: the non-DB hot paths (pipeline / baseline / upload / json / json-comp / static) are now zero-alloc under the default GC, which should also pull io_uring's steady-state memory down toward epoll's (baseline was ~1.5 GiB vs epoll's ~78 MiB, dominated by the per-request sum.str() / json.encode churn this PR removes).

What is intentionally NOT changed

DB profiles (fortunes, async-db, api-*, crud) stay capped by the blocking db.pg on the single ring worker — the io_uring backend has no async runtime to await DB readiness on the ring. Tracked in enghitalo/vanilla#83 (async runtime) and enghitalo/vanilla#93 (per-worker state). Those are the only remaining divergences from the epoll twin.

Validation

  • Both images build (v -prod -d vanilla_tls).
  • Ran both containers against a pristine seeded Postgres (fresh DB per framework) and diffed every route: pipeline, baseline11 (positive and negative), upload, json, json-comp, async-db, fortunes, static (br negotiation), crud list, crud GET (MISS then HIT), crud create, crud update, 404, and json-tlsall 17 byte-for-byte identical to vanilla-epoll.
  • X-Cache sequence verified: GET MISS → HIT, then re-MISS after a PUT (slab invalidation). POST → 201. json-compContent-Encoding: gzip. json-tls → 200 over TLS 1.3.

🤖 Generated with Claude Code

…ering, /pipeline skip-decode, crud slab)

Port every backend-agnostic optimization from the vanilla-epoll entry so the two
share one audited set of response builders and diff cleanly. The io_uring backend
supports only a stateless request_handler (no async_handler / make_state / TLS,
enghitalo/vanilla#83), so DB access stays on the blocking db.pg client; everything
else now matches epoll byte-for-byte.

Implements enghitalo/vanilla#84 (zero-alloc int parse/format) and
enghitalo/vanilla#85 (crud: 1-query list, byte-rendered GET, fast body parse):

- wi: negative-aware (fixes a latent wrong body for a negative /baseline11 sum)
- emit / emit_int (stack scratch) / emit_xcache: zero-alloc response framing;
  /baseline11 and /upload no longer allocate an int->string per request
- /pipeline: skip-decode fast path (blit the const before parsing) + decode_into
  (no Result boxing) on the main parse path
- render_item_pg: byte-level JSON straight from db.pg text rows — removes the
  per-request json.encode reflection on /async-db, /crud list, /crud GET
- crud cache: id-indexed slab (replaces map[int]string) with in-place buffer reuse
  and cache-aside invalidation, shared across ring workers under RwMutex
- crud_list: single windowed query (count(*) OVER()) instead of page + count(*)
- parse_crud_body_fast + borrowed json field parsers (json.decode fallback kept)
- parse_i64_slice / dechunk_into / parse_hex_slice: allocation-free parsing
- static: sendfile_min_bytes=16KiB, matching epoll (bounds per-conn RSS at high conns)

DB profiles remain capped by the blocking db.pg on the single ring worker
(enghitalo/vanilla#83) — unchanged here by design.

Validated: both images build; every route (pipeline, baseline +/-, upload, json,
json-comp, async-db, fortunes, static, crud list/get/create/update, 404, json-tls)
is byte-for-byte identical to vanilla-epoll against a pristine seeded Postgres, and
the X-Cache MISS->HIT->re-MISS-after-PUT sequence holds.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@enghitalo

Copy link
Copy Markdown
Contributor Author

/benchmark -f vanilla-io_uring

@github-actions

github-actions Bot commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

👋 /benchmark request received. A collaborator will review and approve the run.

@github-actions

github-actions Bot commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

Benchmark Results

Framework: vanilla-io_uring | Test: all tests

Test Conn RPS CPU Mem Δ RPS Δ Mem
baseline 512 2,763,760 5774.5% 1.5GiB -15.6% ~0%
baseline 4096 2,484,132 5594.6% 1.6GiB -32.0% +6.7%
pipelined 512 40,666,368 6646.7% 1012MiB +13.6% +2.5%
pipelined 4096 43,192,371 6490.9% 1.2GiB +13.9% +9.1%
limited-conn 512 2,021,944 5218.2% 1.5GiB -11.7% ~0%
limited-conn 4096 2,024,149 4877.0% 1.6GiB -7.3% +6.7%
json 4096 2,417,941 6389.8% 1.4GiB ~0% ~0%
json-comp 512 2,127,276 6010.1% 1.0GiB +2.2% ~0%
json-comp 4096 2,869,583 6332.5% 1.4GiB +0.9% ~0%
json-comp 16384 233,525 663.8% 2.0GiB +49.7% ~0%
json-tls 4096 1,491,199 6078.6% 1.3GiB +0.6% -13.3%
upload 32 2,606 1875.5% 1.4GiB +0.9% ~0%
upload 256 2,931 3429.8% 1.4GiB -1.6% +7.7%
api-4 256 31,219 387.2% 4.5GiB +9.8% +125.0%
api-16 1024 15,077 1648.1% 3.1GiB -47.0% +47.6%
static 1024 194,200 774.2% 1.7GiB -86.6% +70.7%
static 4096 189,049 777.0% 2.1GiB -86.0% +75.0%
static 6800 13,186 179.4% 1.8GiB -98.9% +28.6%
async-db 1024 6,000 5756.0% 1.6GiB -45.6% -11.1%
crud 4096 189,411 886.6% 1.7GiB -18.0% -10.5%
fortunes 1024 283 5295.6% 1.6GiB +664.9% -5.9%
Full log
  Threads:   64
  Conns:     4096 (64/thread)
  Pipeline:  1
  Req/conn:  200
  Templates: 20
  Expected:  200
  Duration:  15s


  Thread Stats   Avg      p50      p90      p99    p99.9
    Latency   72.96ms   22.40ms   49.00ms   175.20ms    5.00s

  841828 requests in 15.00s, 841636 responses
  Throughput: 56.10K req/s
  Bandwidth:  17.13MB/s
  Status codes: 2xx=841636, 3xx=0, 4xx=0, 5xx=0
  Latency samples: 841636 / 841636 responses (100.0%)
  Latency overflow (>5s): 4096
  Reconnects: 2142
  Per-template: 41733,43304,43485,42115,42439,41914,44327,44283,43064,44106,43977,44311,41714,43248,43562,42474,34833,35642,40586,40519
  Per-template-ok: 41733,43304,43485,42115,42439,41914,44327,44283,43064,44106,43977,44311,41714,43248,43562,42474,34833,35642,40586,40519
[info] CPU 358.1% | Mem 1.5GiB

[run 2/3]
gcannon v0.5.3
  Target:    localhost:8080/
  Threads:   64
  Conns:     4096 (64/thread)
  Pipeline:  1
  Req/conn:  200
  Templates: 20
  Expected:  200
  Duration:  15s


  Thread Stats   Avg      p50      p90      p99    p99.9
    Latency   22.58ms   15.90ms   46.70ms   180.30ms   233.40ms

  2717704 requests in 15.00s, 2716232 responses
  Throughput: 181.05K req/s
  Bandwidth:  57.27MB/s
  Status codes: 2xx=2716232, 3xx=0, 4xx=0, 5xx=0
  Latency samples: 2716229 / 2716232 responses (100.0%)
  Reconnects: 11724
  Per-template: 126725,129483,135423,138824,140253,143131,140284,141888,142300,143341,139713,139025,139279,141796,142016,141865,132078,118132,118509,122164
  Per-template-ok: 126725,129483,135423,138824,140253,143131,140284,141888,142300,143341,139713,139025,139279,141796,142016,141865,132078,118132,118509,122164
[info] CPU 814.6% | Mem 1.6GiB

[run 3/3]
gcannon v0.5.3
  Target:    localhost:8080/
  Threads:   64
  Conns:     4096 (64/thread)
  Pipeline:  1
  Req/conn:  200
  Templates: 20
  Expected:  200
  Duration:  15s


  Thread Stats   Avg      p50      p90      p99    p99.9
    Latency   21.58ms   16.50ms   41.20ms   178.50ms   218.20ms

  2841174 requests in 15.00s, 2841174 responses
  Throughput: 189.38K req/s
  Bandwidth:  60.08MB/s
  Status codes: 2xx=2841174, 3xx=0, 4xx=0, 5xx=0
  Latency samples: 2841172 / 2841174 responses (100.0%)
  Reconnects: 12323
  Per-template: 137378,139654,141707,144630,145374,146506,143105,144537,145847,147101,147102,143961,143359,142999,146893,150719,142270,128292,127755,131983
  Per-template-ok: 137378,139654,141707,144630,145374,146506,143105,144537,145847,147101,147102,143961,143359,142999,146893,150719,142270,128292,127755,131983
[info] CPU 886.6% | Mem 1.7GiB

=== Best: 189411 req/s (CPU: 886.6%, Mem: 1.7GiB) ===
[info] input BW: 16.26MB/s (avg template: 90 bytes)
[info] saved results/crud/4096/vanilla-io_uring.json
httparena-bench-vanilla-io_uring
httparena-bench-vanilla-io_uring

==============================================
=== vanilla-io_uring / fortunes / 1024c (tool=gcannon) ===
==============================================
[info] resetting postgres for a clean per-profile baseline
[info] starting postgres sidecar
httparena-postgres
[info] postgres ready (seeded)
[info] waiting for server...
[info] server ready

[run 1/3]
gcannon v0.5.3
  Target:    localhost:8080/fortunes
  Threads:   64
  Conns:     1024 (16/thread)
  Pipeline:  1
  Req/conn:  unlimited (keep-alive)
  Expected:  200
  Duration:  5s


  Thread Stats   Avg      p50      p90      p99    p99.9
    Latency      0us      0us      0us      0us      0us

  0 requests in 5.00s, 0 responses
  Throughput: 0 req/s
  Bandwidth:  0B/s
  Status codes: 2xx=0, 3xx=0, 4xx=0, 5xx=0
  Latency samples: 0 / 0 responses (0.0%)
[info] CPU 118.2% | Mem 593MiB

[run 2/3]
gcannon v0.5.3
  Target:    localhost:8080/fortunes
  Threads:   64
  Conns:     1024 (16/thread)
  Pipeline:  1
  Req/conn:  unlimited (keep-alive)
  Expected:  200
  Duration:  5s


  Thread Stats   Avg      p50      p90      p99    p99.9
    Latency    4.86s    4.89s    4.97s    5.00s    5.00s

  21 requests in 5.00s, 21 responses
  Throughput: 4 req/s
  Bandwidth:  101.91KB/s
  Status codes: 2xx=21, 3xx=0, 4xx=0, 5xx=0
  Latency samples: 21 / 21 responses (100.0%)
  Latency overflow (>5s): 2
[info] CPU 2923.4% | Mem 1.3GiB

[run 3/3]
gcannon v0.5.3
  Target:    localhost:8080/fortunes
  Threads:   64
  Conns:     1024 (16/thread)
  Pipeline:  1
  Req/conn:  unlimited (keep-alive)
  Expected:  200
  Duration:  5s


  Thread Stats   Avg      p50      p90      p99    p99.9
    Latency    2.51s    2.39s    4.12s    4.83s    4.89s

  1419 requests in 5.00s, 1419 responses
  Throughput: 283 req/s
  Bandwidth:  6.72MB/s
  Status codes: 2xx=1419, 3xx=0, 4xx=0, 5xx=0
  Latency samples: 1419 / 1419 responses (100.0%)
[info] CPU 5295.6% | Mem 1.6GiB

=== Best: 283 req/s (CPU: 5295.6%, Mem: 1.6GiB) ===
[info] saved results/fortunes/1024/vanilla-io_uring.json
httparena-bench-vanilla-io_uring
httparena-bench-vanilla-io_uring
[info] skip: vanilla-io_uring does not subscribe to baseline-h2
[info] skip: vanilla-io_uring does not subscribe to static-h2
[info] skip: vanilla-io_uring does not subscribe to baseline-h2c
[info] skip: vanilla-io_uring does not subscribe to json-h2c
[info] skip: vanilla-io_uring does not subscribe to baseline-h3
[info] skip: vanilla-io_uring does not subscribe to static-h3
[info] skip: vanilla-io_uring does not subscribe to gateway-64
[info] skip: vanilla-io_uring does not subscribe to gateway-h3
[info] skip: vanilla-io_uring does not subscribe to production-stack
[info] skip: vanilla-io_uring does not subscribe to unary-grpc
[info] skip: vanilla-io_uring does not subscribe to unary-grpc-tls
[info] skip: vanilla-io_uring does not subscribe to stream-grpc
[info] skip: vanilla-io_uring does not subscribe to stream-grpc-tls
[info] skip: vanilla-io_uring does not subscribe to echo-ws
[info] skip: vanilla-io_uring does not subscribe to echo-ws-pipeline
[info] rebuilding site/data/*.json
[updated] /home/diogo/actions-runner/_work/HttpArena/HttpArena/site/data/frameworks.json
[updated] /home/diogo/actions-runner/_work/HttpArena/HttpArena/site/data/api-16-1024.json
[updated] /home/diogo/actions-runner/_work/HttpArena/HttpArena/site/data/api-4-256.json
[updated] /home/diogo/actions-runner/_work/HttpArena/HttpArena/site/data/async-db-1024.json
[updated] /home/diogo/actions-runner/_work/HttpArena/HttpArena/site/data/baseline-4096.json
[updated] /home/diogo/actions-runner/_work/HttpArena/HttpArena/site/data/baseline-512.json
[updated] /home/diogo/actions-runner/_work/HttpArena/HttpArena/site/data/crud-4096.json
[updated] /home/diogo/actions-runner/_work/HttpArena/HttpArena/site/data/fortunes-1024.json
[updated] /home/diogo/actions-runner/_work/HttpArena/HttpArena/site/data/json-4096.json
[updated] /home/diogo/actions-runner/_work/HttpArena/HttpArena/site/data/json-comp-16384.json
[updated] /home/diogo/actions-runner/_work/HttpArena/HttpArena/site/data/json-comp-4096.json
[updated] /home/diogo/actions-runner/_work/HttpArena/HttpArena/site/data/json-comp-512.json
[updated] /home/diogo/actions-runner/_work/HttpArena/HttpArena/site/data/json-tls-4096.json
[updated] /home/diogo/actions-runner/_work/HttpArena/HttpArena/site/data/limited-conn-4096.json
[updated] /home/diogo/actions-runner/_work/HttpArena/HttpArena/site/data/limited-conn-512.json
[updated] /home/diogo/actions-runner/_work/HttpArena/HttpArena/site/data/pipelined-4096.json
[updated] /home/diogo/actions-runner/_work/HttpArena/HttpArena/site/data/pipelined-512.json
[updated] /home/diogo/actions-runner/_work/HttpArena/HttpArena/site/data/static-1024.json
[updated] /home/diogo/actions-runner/_work/HttpArena/HttpArena/site/data/static-4096.json
[updated] /home/diogo/actions-runner/_work/HttpArena/HttpArena/site/data/static-6800.json
[updated] /home/diogo/actions-runner/_work/HttpArena/HttpArena/site/data/upload-256.json
[updated] /home/diogo/actions-runner/_work/HttpArena/HttpArena/site/data/upload-32.json
[updated] /home/diogo/actions-runner/_work/HttpArena/HttpArena/site/data/current.json
[info] done
httparena-postgres
httparena-redis
[info] restoring loopback MTU to 65536

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant