Skip to content

Prepare larger MS MARCO scale tier#108

Merged
SonAIengine merged 1 commit into
mainfrom
large-msmarco-tagged-anchors
Jul 2, 2026
Merged

Prepare larger MS MARCO scale tier#108
SonAIengine merged 1 commit into
mainfrom
large-msmarco-tagged-anchors

Conversation

@SonAIengine

@SonAIengine SonAIengine commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Summary

  • add tag-aware node lookup for Memory/SQLite/Composite backends and use it for QueryAnchor category loading
  • add side-by-side MS MARCO large-shard options (--large-output-suffix, --msmarco-path) so 1M and 5M manifests can coexist
  • document the 1M reuse improvement and local 5M MS MARCO shard

Results

  • 1M MS MARCO persistent reuse: 50-query search improved from 9.1s to 7.5s with unchanged MRR@10 0.479 and Hit@10 31/50
  • First QueryAnchor category load on the 1M DB: about 0.218s vs previous roughly 1.7-2.0s
  • Created local gitignored 5M shard: tests/benchmark/data/msmarco_passage_5m.json + .corpus.jsonl (5,000,000 rows)

Tests

  • uv run ruff check src/synaptic/extensions/query_anchor.py src/synaptic/backends/memory.py src/synaptic/backends/sqlite.py src/synaptic/backends/composite.py examples/ablation/download_benchmarks.py examples/ablation/run_tier1_benchmarks.py tests/test_query_anchor.py tests/test_backend_memory.py tests/test_backend_sqlite.py tests/test_download_benchmarks.py tests/test_tier1_benchmarks.py
  • uv run pytest tests/test_download_benchmarks.py tests/test_tier1_benchmarks.py tests/test_query_anchor.py tests/test_backend_memory.py::TestMemoryBackendNodes::test_list_nodes_by_tag_filters_kind_and_limit tests/test_backend_sqlite.py::TestListNodesByTag -q
  • uv run python examples/ablation/run_tier1_benchmarks.py --only msmarco --subset 50 --use-sqlite-graph --sqlite-db-path tests/benchmark/data/msmarco_1m.db --reuse-sqlite-db --corpus-limit 1000000 --max-search-sec 30 --min-hit-rate-at-10 0.0 --min-mrr 0.0

@SonAIengine SonAIengine force-pushed the large-msmarco-tagged-anchors branch from 5ce1e78 to 34caed3 Compare July 2, 2026 04:22
@SonAIengine SonAIengine merged commit 5a5afb1 into main Jul 2, 2026
2 checks passed
@SonAIengine SonAIengine deleted the large-msmarco-tagged-anchors branch July 2, 2026 04:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant