PlateerLab · SonAIengine · Jul 2, 2026 · Jul 2, 2026 · Jul 2, 2026 · Jul 2, 2026
diff --git a/.github/workflows/public-scale.yml b/.github/workflows/public-scale.yml
@@ -0,0 +1,80 @@
+name: Public Scale Guard
+
+# Scheduled, public large-corpus smoke for the SQLite/EvidenceSearch path.
+# Benchmark JSON files are gitignored, so this workflow regenerates public
+# datasets from HuggingFace and runs staged guards that keep selected query
+# gold docs in each indexed corpus.
+
+on:
+  schedule:
+    - cron: "30 6 * * 2" # Tuesdays 06:30 UTC
+  workflow_dispatch: {}
+
+jobs:
+  public-scale:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+
+      - name: Install uv
+        uses: astral-sh/setup-uv@v5
+
+      - name: Set up Python 3.12
+        run: uv python install 3.12
+
+      - name: Cache uv
+        uses: actions/cache@v4
+        with:
+          path: ~/.cache/uv
+          key: uv-${{ runner.os }}-${{ hashFiles('uv.lock') }}
+          restore-keys: uv-${{ runner.os }}-
+
+      - name: Install dependencies
+        run: uv sync --extra sqlite --extra eval
+
+      - name: Download public scale benchmarks
+        run: |
+          uv run --extra eval python examples/ablation/download_benchmarks.py \
+            --only fiqa,trec_covid
+
+      - name: Run FiQA 10k scale smoke
+        run: |
+          set -o pipefail
+          PYTHONUNBUFFERED=1 uv run --extra sqlite python examples/ablation/run_tier1_benchmarks.py \
+            --only fiqa \
+            --subset 5 \
+            --corpus-limit 10000 \
+            --use-sqlite-graph \
+            --max-build-sec 120 \
+            --max-search-sec 20 \
+            --min-hit-rate-at-10 0.40 \
+            --min-mrr 0.20 | tee /tmp/fiqa_scale_guard.log
+          cp "$(ls -t examples/ablation/diagnostics/tier1_*.md | head -1)" \
+            /tmp/fiqa_scale_guard.md
+
+      - name: Run TREC-COVID 50k scale smoke
+        run: |
+          set -o pipefail
+          PYTHONUNBUFFERED=1 uv run --extra sqlite python examples/ablation/run_tier1_benchmarks.py \
+            --only trec_covid \
+            --subset 10 \
+            --corpus-limit 50000 \
+            --use-sqlite-graph \
+            --max-build-sec 240 \
+            --max-search-sec 30 \
+            --min-hit-rate-at-10 0.80 \
+            --min-mrr 0.50 | tee /tmp/trec_covid_scale_guard.log
+          cp "$(ls -t examples/ablation/diagnostics/tier1_*.md | head -1)" \
+            /tmp/trec_covid_scale_guard.md
+
+      - name: Upload public scale results
+        if: always()
+        uses: actions/upload-artifact@v4
+        with:
+          name: public-scale-results
+          path: |
+            /tmp/fiqa_scale_guard.log
+            /tmp/fiqa_scale_guard.md
+            /tmp/trec_covid_scale_guard.log
+            /tmp/trec_covid_scale_guard.md
+          if-no-files-found: ignore
diff --git a/examples/ablation/diagnostics/public_scale_20260702.md b/examples/ablation/diagnostics/public_scale_20260702.md
@@ -0,0 +1,70 @@
+# Public Large-Corpus Scale Smoke - 2026-07-02
+
+## Datasets
+
+| Dataset | Local artifact | Corpus | Queries | Smoke scope |
+|---------|----------------|-------:|--------:|-------------|
+| BEIR FiQA test | `tests/benchmark/data/fiqa.json` | 57,638 docs | 648 | 5-10 queries |
+| BEIR TREC-COVID test | `tests/benchmark/data/trec_covid.json` | 171,332 docs | 50 | 10 queries |
+
+Mode: embedder-free `graph.search()` with `SqliteGraphBackend`.
+
+## Commands
+
+```bash
+uv run --extra eval python examples/ablation/download_benchmarks.py --only fiqa
+uv run --extra eval python examples/ablation/download_benchmarks.py --only trec_covid
+
+PYTHONUNBUFFERED=1 uv run --extra sqlite python examples/ablation/run_tier1_benchmarks.py --only fiqa --subset 5 --corpus-limit 10000 --use-sqlite-graph
+PYTHONUNBUFFERED=1 uv run --extra sqlite python examples/ablation/run_tier1_benchmarks.py --only fiqa --subset 10 --corpus-limit 25000 --use-sqlite-graph
+PYTHONUNBUFFERED=1 uv run --extra sqlite python examples/ablation/run_tier1_benchmarks.py --only fiqa --subset 10 --use-sqlite-graph
+
+PYTHONUNBUFFERED=1 uv run --extra sqlite python examples/ablation/run_tier1_benchmarks.py --only trec_covid --subset 10 --corpus-limit 50000 --use-sqlite-graph
+PYTHONUNBUFFERED=1 uv run --extra sqlite python examples/ablation/run_tier1_benchmarks.py --only trec_covid --subset 10 --corpus-limit 100000 --use-sqlite-graph
+PYTHONUNBUFFERED=1 uv run --extra sqlite python examples/ablation/run_tier1_benchmarks.py --only trec_covid --subset 10 --use-sqlite-graph
+```
+
+## FiQA Results
+
+Before the SQLite batch FTS optimization:
+
+| Docs | Queries | MRR@10 | R@5 | R@10 | Hit@10 | Build | Search |
+|-----:|--------:|-------:|----:|-----:|-------:|------:|-------:|
+| 10,000 | 5 | 0.425 | 0.300 | 0.400 | 3/5 | 13.2s | 0.1s |
+| 25,000 | 10 | 0.353 | 0.333 | 0.383 | 5/10 | 101.7s | 0.6s |
+| 57,638 | 10 | 0.202 | 0.233 | 0.333 | 5/10 | 577.4s | 1.5s |
+
+After the SQLite batch FTS optimization:
+
+| Docs | Queries | MRR@10 | R@5 | R@10 | Hit@10 | Build | Search |
+|-----:|--------:|-------:|----:|-----:|-------:|------:|-------:|
+| 10,000 | 5 | 0.425 | 0.300 | 0.400 | 3/5 | 3.2s | 0.1s |
+| 25,000 | 10 | 0.353 | 0.333 | 0.383 | 5/10 | 9.3s | 0.6s |
+| 57,638 | 10 | 0.202 | 0.233 | 0.333 | 5/10 | 58.4s | 1.4s |
+
+## TREC-COVID Results
+
+After the SQLite batch FTS optimization:
+
+| Docs | Queries | MRR@10 | R@5 | R@10 | Hit@10 | Build | Search |
+|-----:|--------:|-------:|----:|-----:|-------:|------:|-------:|
+| 50,000 | 10 | 0.933 | 0.008 | 0.015 | 10/10 | 20.6s | 1.4s |
+| 100,000 | 10 | 0.750 | 0.007 | 0.012 | 10/10 | 55.2s | 2.8s |
+| 171,332 | 10 | 0.598 | 0.004 | 0.011 | 10/10 | 135.1s | 5.2s |
+
+TREC-COVID has many relevant documents per query, so R@5/R@10 is naturally
+small in this smoke even when Hit@10 is perfect.
+
+## Interpretation
+
+- Search latency remains usable at 171k docs: 5.2s over 10 queries.
+- The main large-corpus bottleneck is still initial FTS/index build, not retrieval.
+- Avoiding unnecessary FTS deletes for newly inserted nodes reduced full FiQA build time by about 9.9x.
+- Raising benchmark ingest batches to 20k reduced full TREC-COVID build time by about 2.7x.
+- `--corpus-limit` provides practical staged scale gates while preserving selected query gold docs.
+
+## Guard Policy
+
+- `.github/workflows/public-scale.yml` runs weekly/manual FiQA 10k and TREC-COVID 50k staged smokes.
+- FiQA 25k/full and TREC-COVID 100k/full remain manual checks because they are multi-minute runs and depend on ignored local benchmark data.
+- If 100k+ docs becomes a required routine gate, the next target is faster initial FTS/index build.