perf(vanilla-io_uring): converge onto the epoll twin — zero-alloc rendering, /pipeline skip-decode, crud slab (impl vanilla#84, #85)#965
Conversation
…ering, /pipeline skip-decode, crud slab) Port every backend-agnostic optimization from the vanilla-epoll entry so the two share one audited set of response builders and diff cleanly. The io_uring backend supports only a stateless request_handler (no async_handler / make_state / TLS, enghitalo/vanilla#83), so DB access stays on the blocking db.pg client; everything else now matches epoll byte-for-byte. Implements enghitalo/vanilla#84 (zero-alloc int parse/format) and enghitalo/vanilla#85 (crud: 1-query list, byte-rendered GET, fast body parse): - wi: negative-aware (fixes a latent wrong body for a negative /baseline11 sum) - emit / emit_int (stack scratch) / emit_xcache: zero-alloc response framing; /baseline11 and /upload no longer allocate an int->string per request - /pipeline: skip-decode fast path (blit the const before parsing) + decode_into (no Result boxing) on the main parse path - render_item_pg: byte-level JSON straight from db.pg text rows — removes the per-request json.encode reflection on /async-db, /crud list, /crud GET - crud cache: id-indexed slab (replaces map[int]string) with in-place buffer reuse and cache-aside invalidation, shared across ring workers under RwMutex - crud_list: single windowed query (count(*) OVER()) instead of page + count(*) - parse_crud_body_fast + borrowed json field parsers (json.decode fallback kept) - parse_i64_slice / dechunk_into / parse_hex_slice: allocation-free parsing - static: sendfile_min_bytes=16KiB, matching epoll (bounds per-conn RSS at high conns) DB profiles remain capped by the blocking db.pg on the single ring worker (enghitalo/vanilla#83) — unchanged here by design. Validated: both images build; every route (pipeline, baseline +/-, upload, json, json-comp, async-db, fortunes, static, crud list/get/create/update, 404, json-tls) is byte-for-byte identical to vanilla-epoll against a pristine seeded Postgres, and the X-Cache MISS->HIT->re-MISS-after-PUT sequence holds. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
/benchmark -f vanilla-io_uring |
|
👋 |
Benchmark ResultsFramework:
Full log |
|
/benchmark -f vanilla-io_uring --save |
|
👋 |
…file (static -86%..-99%) The converged entry ported epoll's sendfile_min_bytes=16KiB, but the io_uring backend has NO sendfile path (no core.enable_sendfile / queue_file drain). static_assets then served every representation >= 16KiB (the large .br/.gz siblings) via a per-request disk READ that stalls the single ring worker — CI measured static -86%/-86%/-99% with CPU collapse. Revert to the default (256KiB): every arena sibling (< 256KiB) is preloaded and served as a zero-copy core.queue_buf borrowed send, restoring the original throughput. The epoll twin keeps sendfile_min_bytes=16KiB because its backend DOES support sendfile. Verified locally (wrk, 64 conns): /static/vendor.js (-> 67KB .br) serves at ~101k req/s / 6.38 GB/s with zero socket errors (preloaded, bandwidth-bound) instead of the disk-read collapse. My earlier byte-diff validation missed this — it only exercised a small (preloaded) asset, so it caught correctness but not the large-asset throughput. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Pushed a fix for a static regression the benchmark surfaced (static -86% / -86% / -99%, CPU collapsed): the converged entry had ported epoll's Fix (4bb52d0): drop Verified locally (wrk, 64 conns): Re the other deltas in the run: the DB/api profiles ( |
|
Superseded by #967 — reopened clean with the static-config fix folded in (io_uring has no sendfile, so no |
What & why
vanilla-io_uringwas an under-optimized copy of itsvanilla-epolltwin: same handlers, but allocating throwaway strings per request, usingjson.encode/json.decodereflection on the DB paths, and parsing the full request even for the fixed/pipelineblit. This PR ports every backend-agnostic optimization from the epoll entry so the two share one audited set of response builders and diff cleanly — the stated goal of keeping the two entries easy to maintain in lock-step.The io_uring backend of the vanilla library supports only a stateless
request_handler(noasync_handler/make_state/ TLS — see enghitalo/vanilla#83, enghitalo/vanilla#93), so DB access necessarily stays on the blockingdb.pgclient. Everything that does not require the async runtime or per-worker state now matches epoll byte-for-byte.Implements enghitalo/vanilla#84 (zero-alloc int parse/format) and enghitalo/vanilla#85 (crud: 1-query list, byte-rendered GET, fast crud-body parse).
Changes (all entry-only, no lib change)
wiis now negative-aware — fixes a latent wrong body for a negative/baseline11sum (a=-10&b=3returned garbage; now-7).emit/emit_int(stack scratch) /emit_xcache— zero-alloc response framing./baseline11and/uploadno longer allocate anint -> stringper request (the highest-RPS non-DB profiles)./pipelineskip-decode fast path — blit the constant before any parsing;decode_into(no!HttpRequestResult boxing) on the main parse path.render_item_pg— byte-level JSON straight fromdb.pgtext rows, removing the per-requestjson.encodereflection on/async-db,/crudlist and/crudGET.map[int]string) with in-place buffer reuse and cache-aside invalidation, shared across ring workers underRwMutex— identical structure to the epoll twin.crud_listuses a single windowed query (count(*) OVER()) instead of a pageSELECT+ a separatecount(*).parse_crud_body_fast+ borrowed JSON field parsers (with thejson.decodefallback kept for escaped bodies).parse_i64_slice/dechunk_into/parse_hex_slice— allocation-free query/body parsing (qintno longer materializes a string per param).sendfile_min_bytes = 16 KiB, matching epoll (bounds per-connection RSS at high conn counts).Net: the non-DB hot paths (pipeline / baseline / upload / json / json-comp / static) are now zero-alloc under the default GC, which should also pull io_uring's steady-state memory down toward epoll's (baseline was ~1.5 GiB vs epoll's ~78 MiB, dominated by the per-request
sum.str()/json.encodechurn this PR removes).What is intentionally NOT changed
DB profiles (
fortunes,async-db,api-*,crud) stay capped by the blockingdb.pgon the single ring worker — the io_uring backend has no async runtime to await DB readiness on the ring. Tracked in enghitalo/vanilla#83 (async runtime) and enghitalo/vanilla#93 (per-worker state). Those are the only remaining divergences from the epoll twin.Validation
v -prod -d vanilla_tls).pipeline,baseline11(positive and negative),upload,json,json-comp,async-db,fortunes,static(br negotiation),crudlist,crudGET (MISS then HIT),crudcreate,crudupdate, 404, andjson-tls— all 17 byte-for-byte identical to vanilla-epoll.GETMISS → HIT, then re-MISS after aPUT(slab invalidation).POST→ 201.json-comp→Content-Encoding: gzip.json-tls→ 200 over TLS 1.3.🤖 Generated with Claude Code