Open
Conversation
Complete benchmark comparing AI-generated chat apps on SpacetimeDB vs PostgreSQL (Express + Socket.io + Drizzle ORM). Same model (Claude Sonnet 4.6), same prompts, 12 feature levels, two independent runs. Results: https://spacetimedb.com/llms-benchmark-sequential-upgrade Tooling: run.sh (generation/upgrade/fix orchestrator), grade.sh (grading), OTel cost tracking, perf-benchmark stress throughput tool. Two runs (20260403 original methodology, 20260406 refined with domain bias removed). Both include full app source, level snapshots, and per-session telemetry cost summaries.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description of Changes
AI app generation benchmark comparing SpacetimeDB vs PostgreSQL (Express + Socket.io + Drizzle ORM). Same AI model (Claude Sonnet 4.6), same prompts, same chat app, two backends. Upgraded through 12 feature levels, manually graded at each level, bugs fixed, all costs measured via OpenTelemetry.
Results viewable at: https://spacetimedb.com/llms-benchmark-sequential-upgrade
Benchmark harness (
tools/llm-sequential-upgrade/)run.sh: orchestrates headless Claude Code sessions for code generation, sequential upgrades, and bug fixes. Tracks all API costs via OTel. Supports--upgrade,--fix,--composed-prompt,--resume-sessionmodes.grade.sh/grade-agents.sh: grading harnesses for manual testing of generated apps.docker-compose.otel.yaml: OTel collector + PostgreSQL services.generate-report.mjs/parse-telemetry.mjs: aggregate per-session telemetry into cost reports.backends/: SpacetimeDB SDK reference, config templates, server setup docs, PostgreSQL setup with Drizzle/Socket.io guidance.After LLM Benchmark Improvements + More Evals #4740 merges, we will likely want to update this so that it reads backend and SDK guidance from SKILLS
Two complete benchmark runs
Run 1 (20260403): Original methodology.
Run 2 (20260406): Refined methodology with domain bias removed from SpacetimeDB SDK docs and PostgreSQL instructions made feature-spec-neutral.
Note: no meaningful changes in results were observed with these changes. Domain familiarity biases were very small and almost certainly not the cause of STDB's major gains over PG stack.
Each run contains full L1-L12 app source for both backends, level snapshots preserving state before each upgrade, and per-session OTel cost summaries.
12 feature levels
Results
Both runs agree: SpacetimeDB apps are cheaper to build, have fewer bugs, and require fewer fix iterations. The refined methodology (Run 2) widened the cost gap and confirmed the advantage is structural, not an artifact of domain-biased SDK docs.
Performance benchmark (
perf-benchmark/)Stress throughput tool that fires concurrent writers at peak saturation against the AI-generated
send_messagehandlers.The gap widens with optimization because SpacetimeDB's bottleneck is fixable code patterns in the reducer while PostgreSQL's bottleneck is architectural (sequential network round-trips to an external database).
Optimized reference code with all features preserved is in
perf-benchmark/results/optimized-reference/.Data handling
Per-session cost summaries (
cost-summary.json,COST_REPORT.md,metadata.json) are committed. Raw OTel telemetry (raw-telemetry.jsonl) containing PII is excluded via.gitignoreand stored privately.API and ABI breaking changes
None. All changes are in
tools/llm-sequential-upgrade/. No production code, library, or SDK changes.Expected complexity level and risk
1 - Trivial. Self-contained benchmarking tooling and data. No interaction with production code.
Testing