diff --git a/README.md b/README.md index 00062e3..0101ca6 100644 --- a/README.md +++ b/README.md @@ -149,6 +149,8 @@ We benchmark quality and speed across all methods on ~1,250 queries over 63 repo | CodeRankEmbed | 0.765 | 57 s | 16 ms | | ColGREP | 0.693 | 5.8 s | 124 ms | | BM25 | 0.673 | 263 ms | 0.02 ms | +| grepai | 0.561 | 35 s | 48 ms | +| probe | 0.387 | — | 207 ms | | ripgrep | 0.126 | — | 12 ms | Semble achieves 99% of the performance of the 137M-parameter [CodeRankEmbed](https://huggingface.co/nomic-ai/CodeRankEmbed) Hybrid, while indexing 218x faster and answering queries 11x faster. See [benchmarks](benchmarks/README.md) for per-language results, ablations, and methodology. diff --git a/assets/images/speed_vs_ndcg_cold.png b/assets/images/speed_vs_ndcg_cold.png index 739fba3..2056b96 100644 Binary files a/assets/images/speed_vs_ndcg_cold.png and b/assets/images/speed_vs_ndcg_cold.png differ diff --git a/assets/images/speed_vs_ndcg_warm.png b/assets/images/speed_vs_ndcg_warm.png index 5444582..4d97d23 100644 Binary files a/assets/images/speed_vs_ndcg_warm.png and b/assets/images/speed_vs_ndcg_warm.png differ diff --git a/benchmarks/README.md b/benchmarks/README.md index 3ba8566..b0011e3 100644 --- a/benchmarks/README.md +++ b/benchmarks/README.md @@ -7,6 +7,7 @@ Quality and speed benchmarks for `semble`. - [Ablations](#ablations) - [Dataset](#dataset) - [Methods](#methods) +- [Excluded methods](#excluded-methods) - [Running the benchmarks](#running-the-benchmarks) ## Main results @@ -20,6 +21,8 @@ Quality and speed across all methods. | CodeRankEmbed | 0.765 | 57 s | 16 ms | | ColGREP | 0.693 | 5.8 s | 124 ms | | BM25 | 0.673 | 263 ms | 0.02 ms | +| grepai | 0.561 | 35 s | 48 ms | +| probe | 0.387 | — | 207 ms | | ripgrep | 0.126 | — | 12 ms | | ![Speed vs quality (cold)](../assets/images/speed_vs_ndcg_cold.png) | ![Speed vs quality (warm)](../assets/images/speed_vs_ndcg_warm.png) | @@ -34,28 +37,28 @@ NDCG@10 is averaged across all queries. Speed numbers use one repo per language, NDCG@10 per language, sorted by CodeRankEmbed Hybrid (CRE in the table). Best score per row is bolded. -| Language | semble | CRE Hybrid | CRE | ColGREP | ripgrep | -|---|---:|---:|---:|---:|---:| -| scala | 0.909 | **0.922** | 0.845 | 0.765 | 0.180 | -| cpp | **0.915** | 0.913 | 0.846 | 0.626 | 0.126 | -| ruby | **0.909** | **0.909** | 0.769 | 0.708 | 0.230 | -| elixir | 0.894 | **0.905** | 0.869 | 0.808 | 0.134 | -| javascript | 0.917 | 0.903 | **0.920** | 0.823 | 0.176 | -| zig | **0.913** | 0.901 | 0.807 | 0.474 | 0.000 | -| csharp | 0.885 | **0.889** | 0.743 | 0.614 | 0.117 | -| go | **0.895** | 0.884 | 0.676 | 0.785 | 0.133 | -| python | 0.867 | **0.880** | 0.794 | 0.777 | 0.202 | -| php | 0.858 | **0.874** | 0.758 | 0.663 | 0.123 | -| swift | 0.860 | **0.873** | 0.721 | 0.710 | 0.160 | -| bash | 0.825 | 0.852 | **0.892** | 0.706 | 0.000 | -| lua | 0.823 | **0.847** | 0.803 | 0.798 | 0.000 | -| java | **0.849** | 0.841 | 0.706 | 0.641 | 0.198 | -| kotlin | 0.821 | **0.830** | 0.670 | 0.637 | 0.166 | -| rust | **0.856** | 0.827 | 0.627 | 0.662 | 0.162 | -| c | 0.741 | **0.806** | 0.706 | 0.676 | 0.000 | -| haskell | 0.765 | 0.771 | **0.776** | 0.683 | 0.000 | -| typescript | 0.706 | **0.708** | 0.545 | 0.430 | 0.128 | -| **overall** | **0.854** | **0.862** | **0.765** | **0.693** | **0.126** | +| Language | semble | CRE Hybrid | CRE | ColGREP | grepai | probe | ripgrep | +|---|---:|---:|---:|---:|---:|---:|---:| +| scala | 0.909 | **0.922** | 0.845 | 0.765 | 0.330 | 0.392 | 0.180 | +| cpp | **0.915** | 0.913 | 0.846 | 0.626 | 0.731 | 0.375 | 0.126 | +| ruby | **0.909** | **0.909** | 0.769 | 0.708 | 0.643 | 0.382 | 0.230 | +| elixir | 0.894 | **0.905** | 0.869 | 0.808 | 0.669 | 0.412 | 0.134 | +| javascript | 0.917 | 0.903 | **0.920** | 0.823 | 0.675 | 0.588 | 0.176 | +| zig | **0.913** | 0.901 | 0.807 | 0.474 | 0.755 | 0.369 | 0.000 | +| csharp | 0.885 | **0.889** | 0.743 | 0.614 | 0.277 | 0.392 | 0.117 | +| go | **0.895** | 0.884 | 0.676 | 0.785 | 0.722 | 0.410 | 0.133 | +| python | 0.867 | **0.880** | 0.794 | 0.777 | 0.634 | 0.488 | 0.202 | +| php | 0.858 | **0.874** | 0.758 | 0.663 | 0.402 | 0.340 | 0.123 | +| swift | 0.860 | **0.873** | 0.721 | 0.710 | 0.429 | 0.280 | 0.160 | +| bash | 0.825 | 0.852 | **0.892** | 0.706 | 0.723 | 0.226 | 0.000 | +| lua | 0.823 | **0.847** | 0.803 | 0.798 | 0.699 | 0.336 | 0.000 | +| java | **0.849** | 0.841 | 0.706 | 0.641 | 0.386 | 0.536 | 0.198 | +| kotlin | 0.821 | **0.830** | 0.670 | 0.637 | 0.478 | 0.335 | 0.166 | +| rust | **0.856** | 0.827 | 0.627 | 0.662 | 0.519 | 0.242 | 0.162 | +| c | 0.741 | **0.806** | 0.706 | 0.676 | 0.555 | 0.384 | 0.000 | +| haskell | 0.765 | 0.771 | **0.776** | 0.683 | 0.483 | 0.313 | 0.000 | +| typescript | 0.706 | **0.708** | 0.545 | 0.430 | 0.394 | 0.354 | 0.128 | +| **overall** | **0.854** | **0.862** | **0.765** | **0.693** | **0.561** | **0.387** | **0.126** | ## Ablations @@ -102,10 +105,19 @@ NDCG@10 per language, sorted by CodeRankEmbed Hybrid (CRE in the table). Best sc ## Methods - **[ripgrep](https://github.com/BurntSushi/ripgrep)**: fast regex search over files, included as a raw keyword-match baseline. +- **[probe](https://github.com/buger/probe)**: BM25 keyword ranking backed by tree-sitter parse trees. No persistent index; scans on the fly. - **[ColGREP](https://github.com/lightonai/next-plaid/tree/main/colgrep)**: late-interaction code retrieval built on next-plaid with the [LateOn-Code-edge](https://huggingface.co/lightonai/LateOn-Code-edge) model. +- **[grepai](https://github.com/nicholasgasior/grepai)**: semantic search using [nomic-embed-text](https://huggingface.co/nomic-ai/nomic-embed-text-v1) (137M params) via a local Ollama daemon. - **[CodeRankEmbed](https://huggingface.co/nomic-ai/CodeRankEmbed)**: 137M-param transformer embedding model for code retrieval. *CodeRankEmbed Hybrid* fuses its dense scores with BM25. - **[semble](https://github.com/your-repo/semble)**: this library. [potion-code-16M](https://huggingface.co/minishlab/potion-code-16M) static embeddings + BM25 + the semble reranking stack. +## Excluded methods + +Two tools were considered but not included in the benchmark: + +- **[codanna](https://codanna.io)**: symbol-level semantic search with fastembed. Excluded because it does not support Haskell, Bash, Zig, Scala, Elixir, or Ruby (6 of the 19 benchmark languages, covering 20 of 63 repos (~38% of tasks)). +- **[claude-context](https://github.com/zilliztech/claude-context)**: retrieval-augmented code search using OpenAI embeddings and a vector database. Excluded because it requires a paid OpenAI API key and a running vector-DB service. + ## Running the benchmarks Repos are pinned in `repos.json` and cloned into `~/.cache/semble-bench`: @@ -152,6 +164,42 @@ uv run python -m benchmarks.baselines.ablations --mode semble-semantic +
+probe + +Needs `probe` on `$PATH` (`npm install -g @buger/probe`). + +```bash +uv run python -m benchmarks.baselines.probe +uv run python -m benchmarks.baselines.probe --repo fastapi --repo axios +``` + +
+ +
+grepai + +Needs `grepai` on `$PATH` and Ollama running with `nomic-embed-text` pulled: + +```bash +ollama pull nomic-embed-text +``` + +```bash +uv run python -m benchmarks.baselines.grepai +uv run python -m benchmarks.baselines.grepai --repo fastapi --repo axios +``` + +Large repos take several minutes to index. Use `--timeout ` (default 120) for repos with many files: + +```bash +uv run python -m benchmarks.baselines.grepai --timeout 1800 --output results.json +``` + +The `--output` flag enables resume mode: already-completed repos are skipped on restart. + +
+
ripgrep diff --git a/benchmarks/baselines/grepai.py b/benchmarks/baselines/grepai.py new file mode 100644 index 0000000..0eef5be --- /dev/null +++ b/benchmarks/baselines/grepai.py @@ -0,0 +1,331 @@ +import argparse +import json +import os +import shutil +import signal +import subprocess +import sys +import tempfile +import time +from dataclasses import dataclass +from pathlib import Path + +from benchmarks.data import ( + RepoSpec, + Task, + apply_task_filters, + available_repo_specs, + grouped_tasks, + load_tasks, + save_results, +) +from benchmarks.metrics import file_rank, ndcg_at_k + +_GREPAI = "grepai" +_TOP_K = 10 +_LATENCY_RUNS = 1 # Ollama embedding calls are slow; single run is sufficient +_INDEX_TIMEOUT = 300 +_SEARCH_TIMEOUT = 60 +_WATCH_READY_TIMEOUT = 120 # overridden by --timeout + + +@dataclass(frozen=True) +class RepoResult: + """Per-repo benchmark result.""" + + repo: str + language: str + ndcg10: float + p50_ms: float + index_ms: float + + +def _cleanup_index(benchmark_dir: Path) -> None: + d = benchmark_dir / ".grepai" + if d.exists(): + shutil.rmtree(d, ignore_errors=True) + + +def _build_index(benchmark_dir: Path, *, watch_ready_timeout: int = _WATCH_READY_TIMEOUT) -> tuple[bool, float]: + """Init and index a repo with grepai; return (success, elapsed_ms).""" + _cleanup_index(benchmark_dir) + + init_proc = subprocess.run( + [_GREPAI, "init", "--provider", "ollama", "--yes"], + capture_output=True, + text=True, + cwd=benchmark_dir, + timeout=30, + ) + if init_proc.returncode != 0: + print(f" WARNING: grepai init failed: {init_proc.stderr.strip()}", file=sys.stderr) + return False, 0.0 + + # grepai writes progress bars with \r (no \n), so readline() blocks forever. + # Write stdout to a temp file and poll for the sentinel string instead. + # "Initial scan complete" appears after file scanning but BEFORE embeddings + # finish. Wait for 3 s of output silence after that sentinel to ensure all + # embeddings have been flushed to disk before killing watch. + started = time.perf_counter() + watch_proc: subprocess.Popen[bytes] | None = None + with tempfile.TemporaryFile() as log_f: + watch_proc = subprocess.Popen( + [_GREPAI, "watch"], + stdout=log_f, + stderr=subprocess.STDOUT, + cwd=benchmark_dir, + start_new_session=True, # Own process group so killpg doesn't hit us + ) + try: + deadline = time.perf_counter() + watch_ready_timeout + scan_complete = False + last_size = 0 + idle_since: float | None = None + _IDLE_SETTLE = 3.0 # seconds of silence after scan_complete → embeddings done + + while time.perf_counter() < deadline: + time.sleep(0.3) + log_f.seek(0) + content = log_f.read() + if not scan_complete and b"Initial scan complete" in content: + scan_complete = True + idle_since = time.perf_counter() + if scan_complete: + if len(content) != last_size: + idle_since = time.perf_counter() + last_size = len(content) + elif idle_since is not None and (time.perf_counter() - idle_since) >= _IDLE_SETTLE: + return True, (time.perf_counter() - started) * 1000 + if watch_proc.poll() is not None: + if scan_complete: + return True, (time.perf_counter() - started) * 1000 + break + print( + f" WARNING: grepai watch timed out after {watch_ready_timeout}s", + file=sys.stderr, + ) + return False, (time.perf_counter() - started) * 1000 + finally: + try: + os.killpg(os.getpgid(watch_proc.pid), signal.SIGTERM) + except (ProcessLookupError, PermissionError): + pass + watch_proc.wait(timeout=5) + + +def _run_search(query: str, benchmark_dir: Path, *, top_k: int) -> list[str]: + """Return absolute file paths from grepai JSON search output.""" + cmd = [_GREPAI, "search", query, "--json", "-n", str(top_k)] + try: + proc = subprocess.run( + cmd, + capture_output=True, + text=True, + timeout=_SEARCH_TIMEOUT, + cwd=benchmark_dir, + ) + except subprocess.TimeoutExpired: + return [] + if proc.returncode != 0: + return [] + try: + items = json.loads(proc.stdout) + except json.JSONDecodeError: + return [] + # grepai returns relative paths; make them absolute. + seen: dict[str, None] = {} + for item in items: + rel = item.get("file_path", "") + if rel: + abs_path = str((benchmark_dir / rel).resolve()) + seen[abs_path] = None + return list(seen)[:top_k] + + +def _evaluate_repo( + tasks: list[Task], + benchmark_dir: Path, + *, + verbose: bool = False, +) -> tuple[float, float]: + """Return (mean ndcg@10, p50 latency ms) for a list of tasks.""" + ndcg10_sum = 0.0 + latencies: list[float] = [] + + for task in tasks: + query_latencies: list[float] = [] + file_paths: list[str] = [] + for _ in range(_LATENCY_RUNS): + started = time.perf_counter() + file_paths = _run_search(task.query, benchmark_dir, top_k=_TOP_K) + query_latencies.append((time.perf_counter() - started) * 1000) + latencies.append(sorted(query_latencies)[_LATENCY_RUNS // 2]) + + relevant_ranks = [rank for t in task.all_relevant if (rank := file_rank(file_paths, t.path)) is not None] + q_ndcg10 = ndcg_at_k(relevant_ranks, len(task.all_relevant), _TOP_K) + ndcg10_sum += q_ndcg10 + + if verbose: + print( + f" ndcg@10={q_ndcg10:.3f} ranks={relevant_ranks} n_rel={len(task.all_relevant)} q={task.query!r}", + file=sys.stderr, + ) + print(f" targets: {', '.join(t.path for t in task.all_relevant)}", file=sys.stderr) + print(f" top-5: {[Path(fp).name for fp in file_paths[:5]]}", file=sys.stderr) + + latencies.sort() + return ndcg10_sum / len(tasks), latencies[len(latencies) // 2] + + +def _run_repo( + spec: RepoSpec, + tasks: list[Task], + *, + verbose: bool, + watch_ready_timeout: int = _WATCH_READY_TIMEOUT, +) -> RepoResult | None: + """Index, evaluate, and clean up a single repo.""" + benchmark_dir = spec.benchmark_dir + ok, index_ms = _build_index(benchmark_dir, watch_ready_timeout=watch_ready_timeout) + if not ok: + print(f" SKIP: {spec.name} — grepai indexing failed", file=sys.stderr) + return None + + try: + ndcg10, p50_ms = _evaluate_repo(tasks, benchmark_dir, verbose=verbose) + finally: + _cleanup_index(benchmark_dir) + + return RepoResult(repo=spec.name, language=spec.language, ndcg10=ndcg10, p50_ms=p50_ms, index_ms=index_ms) + + +def _build_summary(results: list[RepoResult]) -> dict: + avg_ndcg10 = sum(r.ndcg10 for r in results) / len(results) + avg_p50 = sum(r.p50_ms for r in results) / len(results) + avg_index = sum(r.index_ms for r in results) / len(results) + return { + "tool": "grepai", + "note": "nomic-embed-text via Ollama (137 M params, ~8× larger than semble's potion-code-16M)", + "repos": [ + { + "repo": r.repo, + "language": r.language, + "ndcg10": round(r.ndcg10, 4), + "p50_ms": round(r.p50_ms, 1), + "index_ms": round(r.index_ms, 0), + } + for r in results + ], + "avg_ndcg10": round(avg_ndcg10, 4), + "avg_p50_ms": round(avg_p50, 1), + "avg_index_ms": round(avg_index, 0), + } + + +def _write_results(results: list[RepoResult], path: Path) -> None: + path.parent.mkdir(parents=True, exist_ok=True) + path.write_text(json.dumps(_build_summary(results), indent=2)) + + +def _parse_args() -> argparse.Namespace: + parser = argparse.ArgumentParser(description="Benchmark grepai on the semble benchmark suite.") + parser.add_argument("--repo", action="append", default=[], help="Limit to one or more repo names.") + parser.add_argument("--language", action="append", default=[], help="Limit to one or more languages.") + parser.add_argument("--verbose", action="store_true", help="Print per-query results.") + parser.add_argument( + "--output", + metavar="FILE", + help="JSON file to write results to; if it already exists, repos already present are skipped (resume mode).", + ) + parser.add_argument( + "--timeout", + type=int, + default=_WATCH_READY_TIMEOUT, + metavar="SECONDS", + help=f"Seconds to wait for embeddings to finish (default: {_WATCH_READY_TIMEOUT}). " + "Increase for large repos (e.g. --timeout 1800).", + ) + return parser.parse_args() + + +def _load_existing(output_path: Path | None) -> dict[str, dict]: + """Load already-completed repos from a prior run's output file.""" + if output_path is None or not output_path.exists(): + return {} + try: + existing_data = json.loads(output_path.read_text()) + existing = {r["repo"]: r for r in existing_data.get("repos", [])} + print(f"Resuming: {len(existing)} repos already done, will skip them.", file=sys.stderr) + return existing + except (json.JSONDecodeError, KeyError): + return {} + + +def main() -> None: + """Run the grepai baseline benchmark.""" + args = _parse_args() + repo_specs = available_repo_specs() + tasks = apply_task_filters( + load_tasks(repo_specs=repo_specs), repos=args.repo or None, languages=args.language or None + ) + + output_path = Path(args.output) if args.output else None + existing = _load_existing(output_path) + + print("grepai (ollama/nomic-embed-text, 137M params)", file=sys.stderr) + print(f"{'Repo':<22} {'Language':<12} {'Index':>9} {'NDCG@10':>8} {'p50':>8}", file=sys.stderr) + print(f"{'-' * 22} {'-' * 12} {'-' * 9} {'-' * 8} {'-' * 8}", file=sys.stderr) + + results: list[RepoResult] = [] + for repo, repo_task_list in sorted(grouped_tasks(tasks).items()): + spec = repo_specs[repo] + if repo in existing: + r = existing[repo] + results.append( + RepoResult( + repo=r["repo"], + language=r["language"], + ndcg10=r["ndcg10"], + p50_ms=r["p50_ms"], + index_ms=r["index_ms"], + ) + ) + print(f"{repo:<22} {'(skipped — already done)':<12}", file=sys.stderr) + continue + if args.verbose: + print(f"\n--- {repo} ---", file=sys.stderr) + result = _run_repo(spec, repo_task_list, verbose=args.verbose, watch_ready_timeout=args.timeout) + if result is None: + continue + results.append(result) + print( + f"{repo:<22} {spec.language:<12} {result.index_ms:>8.0f}ms {result.ndcg10:>8.3f} {result.p50_ms:>7.1f}ms", + file=sys.stderr, + ) + + if output_path: + _write_results(results, output_path) + + if not results: + return + + avg_ndcg10 = sum(r.ndcg10 for r in results) / len(results) + avg_p50 = sum(r.p50_ms for r in results) / len(results) + avg_index = sum(r.index_ms for r in results) / len(results) + print(f"{'-' * 22} {'-' * 12} {'-' * 9} {'-' * 8} {'-' * 8}", file=sys.stderr) + avg_label = f"Average ({len(results)})" + print( + f"{avg_label:<22} {'':<12} {avg_index:>8.0f}ms {avg_ndcg10:>8.3f} {avg_p50:>7.1f}ms", + file=sys.stderr, + ) + + summary = _build_summary(results) + if output_path: + _write_results(results, output_path) + else: + save_results("grepai", summary) + print(json.dumps(summary, indent=2)) + + +if __name__ == "__main__": + main() diff --git a/benchmarks/baselines/probe.py b/benchmarks/baselines/probe.py new file mode 100644 index 0000000..6f7699f --- /dev/null +++ b/benchmarks/baselines/probe.py @@ -0,0 +1,160 @@ +import argparse +import json +import subprocess +import sys +import time +from dataclasses import dataclass +from pathlib import Path + +from benchmarks.data import ( + Task, + apply_task_filters, + available_repo_specs, + grouped_tasks, + load_tasks, + save_results, +) +from benchmarks.metrics import file_rank, ndcg_at_k + +_TOP_K = 10 +_LATENCY_RUNS = 3 + + +@dataclass(frozen=True) +class RepoResult: + """Per-repo benchmark result.""" + + repo: str + language: str + ndcg10: float + p50_ms: float + + +def _run_probe(query: str, benchmark_dir: Path, *, top_k: int, timeout: int = 30) -> list[str]: + """Return file paths from probe JSON output, deduplicated and capped at top_k.""" + cmd = [ + "probe", + "search", + query, + str(benchmark_dir), + "--format", + "json", + "--max-results", + str(top_k * 3), # probe returns chunk-level results; over-fetch and dedup + ] + try: + proc = subprocess.run(cmd, capture_output=True, text=True, timeout=timeout) + except subprocess.TimeoutExpired: + return [] + if proc.returncode != 0: + return [] + # probe prefixes stdout with non-JSON header lines ("Pattern: ...\nPath: ...\n") + # before the JSON object; skip to the first '{'. + json_start = proc.stdout.find("{") + if json_start < 0: + return [] + try: + data = json.loads(proc.stdout[json_start:]) + except json.JSONDecodeError: + return [] + seen: dict[str, None] = {} + for item in data.get("results", []): + fp = item.get("file", "") + if fp: + seen[fp] = None + return list(seen)[:top_k] + + +def _evaluate_repo( + tasks: list[Task], + benchmark_dir: Path, + *, + verbose: bool = False, +) -> tuple[float, float]: + """Return (mean ndcg@10, p50 latency ms) for a list of tasks.""" + ndcg10_sum = 0.0 + latencies: list[float] = [] + + for task in tasks: + query_latencies: list[float] = [] + file_paths: list[str] = [] + for _ in range(_LATENCY_RUNS): + started = time.perf_counter() + file_paths = _run_probe(task.query, benchmark_dir, top_k=_TOP_K) + query_latencies.append((time.perf_counter() - started) * 1000) + latencies.append(sorted(query_latencies)[_LATENCY_RUNS // 2]) + + relevant_ranks = [rank for t in task.all_relevant if (rank := file_rank(file_paths, t.path)) is not None] + q_ndcg10 = ndcg_at_k(relevant_ranks, len(task.all_relevant), _TOP_K) + ndcg10_sum += q_ndcg10 + + if verbose: + print( + f" ndcg@10={q_ndcg10:.3f} ranks={relevant_ranks} n_rel={len(task.all_relevant)} q={task.query!r}", + file=sys.stderr, + ) + print(f" targets: {', '.join(t.path for t in task.all_relevant)}", file=sys.stderr) + print(f" top-5: {[Path(fp).name for fp in file_paths[:5]]}", file=sys.stderr) + + latencies.sort() + return ndcg10_sum / len(tasks), latencies[len(latencies) // 2] + + +def _parse_args() -> argparse.Namespace: + parser = argparse.ArgumentParser(description="Benchmark probe on the semble benchmark suite.") + parser.add_argument("--repo", action="append", default=[], help="Limit to one or more repo names.") + parser.add_argument("--language", action="append", default=[], help="Limit to one or more languages.") + parser.add_argument("--verbose", action="store_true", help="Print per-query results.") + return parser.parse_args() + + +def main() -> None: + """Run the probe baseline benchmark.""" + args = _parse_args() + repo_specs = available_repo_specs() + tasks = apply_task_filters( + load_tasks(repo_specs=repo_specs), repos=args.repo or None, languages=args.language or None + ) + + print("probe (bm25, tree-sitter)", file=sys.stderr) + print("NOTE: probe uses keyword ranking; natural-language queries disadvantage it.", file=sys.stderr) + print(f"{'Repo':<22} {'Language':<12} {'NDCG@10':>8} {'p50':>8}", file=sys.stderr) + print(f"{'-' * 22} {'-' * 12} {'-' * 8} {'-' * 8}", file=sys.stderr) + + results: list[RepoResult] = [] + for repo, repo_task_list in sorted(grouped_tasks(tasks).items()): + spec = repo_specs[repo] + if args.verbose: + print(f"\n--- {repo} ---", file=sys.stderr) + ndcg10, p50_ms = _evaluate_repo(repo_task_list, spec.benchmark_dir, verbose=args.verbose) + results.append(RepoResult(repo=repo, language=spec.language, ndcg10=ndcg10, p50_ms=p50_ms)) + print(f"{repo:<22} {spec.language:<12} {ndcg10:>8.3f} {p50_ms:>7.1f}ms", file=sys.stderr) + + if not results: + return + + avg_ndcg10 = sum(r.ndcg10 for r in results) / len(results) + avg_p50 = sum(r.p50_ms for r in results) / len(results) + print(f"{'-' * 22} {'-' * 12} {'-' * 8} {'-' * 8}", file=sys.stderr) + avg_label = f"Average ({len(results)})" + print( + f"{avg_label:<22} {'':<12} {avg_ndcg10:>8.3f} {avg_p50:>7.1f}ms", + file=sys.stderr, + ) + + summary = { + "tool": "probe", + "note": "BM25 + tree-sitter; no embedding model, no persistent index; natural-language queries disadvantage it", + "repos": [ + {"repo": r.repo, "language": r.language, "ndcg10": round(r.ndcg10, 4), "p50_ms": round(r.p50_ms, 1)} + for r in results + ], + "avg_ndcg10": round(avg_ndcg10, 4), + "avg_p50_ms": round(avg_p50, 1), + } + save_results("probe", summary) + print(json.dumps(summary, indent=2)) + + +if __name__ == "__main__": + main() diff --git a/benchmarks/plot.py b/benchmarks/plot.py index db37f55..517f627 100644 --- a/benchmarks/plot.py +++ b/benchmarks/plot.py @@ -24,11 +24,19 @@ class _Method(TypedDict): { "name": "ripgrep", "ndcg10": 0.126, - "index_ms": 0.0, + "index_ms": 0.0, # no persistent index; scans on the fly "query_p50_ms": 12.08, "color": "#606060", "params_m": 0, }, + { + "name": "probe", + "ndcg10": 0.387, + "index_ms": 0.0, # no persistent index; scans on the fly + "query_p50_ms": 207.1, + "color": "#9b7bb0", + "params_m": 0, + }, { "name": "BM25", "ndcg10": 0.673, @@ -45,6 +53,14 @@ class _Method(TypedDict): "color": "#e8a838", "params_m": 16, }, + { + "name": "grepai", + "ndcg10": 0.561, + "index_ms": 34955.0, + "query_p50_ms": 47.7, + "color": "#c0724a", + "params_m": 137, + }, { "name": "CodeRankEmbed", "ndcg10": 0.7648, @@ -170,7 +186,17 @@ def _make_plot(out_path: Path, *, warm: bool = False) -> None: ) x_label = (x ** (1 / 3) + cbrt_label_delta) ** 3 - ax.text(x_label, y, m["name"], fontsize=8.5, color=m["color"], ha="left", va="center", zorder=4) + ax.text( + x_label, + y, + m["name"], + fontsize=8.5, + fontweight="bold" if m["name"] == "semble" else "normal", + color=m["color"], + ha="left", + va="center", + zorder=4, + ) ax.set_xscale("function", functions=(_cbrt_forward, _cbrt_inverse)) ax.set_ylabel("NDCG@10", fontsize=10, color="#444444") diff --git a/benchmarks/results/grepai-715563a812c3.json b/benchmarks/results/grepai-715563a812c3.json new file mode 100644 index 0000000..b65f4a7 --- /dev/null +++ b/benchmarks/results/grepai-715563a812c3.json @@ -0,0 +1,449 @@ +{ + "tool": "grepai", + "repos": [ + { + "repo": "abseil-cpp", + "language": "cpp", + "ndcg10": 0.5955, + "p50_ms": 147.9, + "index_ms": 226627.0 + }, + { + "repo": "aeson", + "language": "haskell", + "ndcg10": 0.6627, + "p50_ms": 30.2, + "index_ms": 10019.0 + }, + { + "repo": "aiohttp", + "language": "python", + "ndcg10": 0.6469, + "p50_ms": 35.9, + "index_ms": 15180.0 + }, + { + "repo": "alamofire", + "language": "swift", + "ndcg10": 0.5664, + "p50_ms": 35.9, + "index_ms": 11261.0 + }, + { + "repo": "axios", + "language": "javascript", + "ndcg10": 0.4167, + "p50_ms": 26.4, + "index_ms": 3941.0 + }, + { + "repo": "axum", + "language": "rust", + "ndcg10": 0.5127, + "p50_ms": 34.1, + "index_ms": 14836.0 + }, + { + "repo": "bash-it", + "language": "bash", + "ndcg10": 0.7448, + "p50_ms": 47.0, + "index_ms": 21867.0 + }, + { + "repo": "bats-core", + "language": "bash", + "ndcg10": 0.425, + "p50_ms": 24.2, + "index_ms": 1518.0 + }, + { + "repo": "cats", + "language": "scala", + "ndcg10": 0.2283, + "p50_ms": 48.4, + "index_ms": 31562.0 + }, + { + "repo": "chi", + "language": "go", + "ndcg10": 0.7807, + "p50_ms": 33.3, + "index_ms": 8811.0 + }, + { + "repo": "circe", + "language": "scala", + "ndcg10": 0.4538, + "p50_ms": 26.7, + "index_ms": 4250.0 + }, + { + "repo": "click", + "language": "python", + "ndcg10": 0.9217, + "p50_ms": 28.6, + "index_ms": 6980.0 + }, + { + "repo": "cobra", + "language": "go", + "ndcg10": 0.7778, + "p50_ms": 32.7, + "index_ms": 12440.0 + }, + { + "repo": "commons-lang", + "language": "java", + "ndcg10": 0.4406, + "p50_ms": 74.7, + "index_ms": 64942.0 + }, + { + "repo": "curl", + "language": "c", + "ndcg10": 0.5632, + "p50_ms": 91.8, + "index_ms": 114986.0 + }, + { + "repo": "dapper", + "language": "csharp", + "ndcg10": 0.3767, + "p50_ms": 32.1, + "index_ms": 8783.0 + }, + { + "repo": "ecto", + "language": "elixir", + "ndcg10": 0.7364, + "p50_ms": 39.2, + "index_ms": 22146.0 + }, + { + "repo": "exposed", + "language": "kotlin", + "ndcg10": 0.4851, + "p50_ms": 38.1, + "index_ms": 15134.0 + }, + { + "repo": "express", + "language": "javascript", + "ndcg10": 0.9622, + "p50_ms": 23.5, + "index_ms": 1519.0 + }, + { + "repo": "fastapi", + "language": "python", + "ndcg10": 0.4914, + "p50_ms": 33.7, + "index_ms": 11810.0 + }, + { + "repo": "flask", + "language": "python", + "ndcg10": 0.6361, + "p50_ms": 30.0, + "index_ms": 6681.0 + }, + { + "repo": "fmtlib", + "language": "cpp", + "ndcg10": 0.8105, + "p50_ms": 30.5, + "index_ms": 14265.0 + }, + { + "repo": "gin", + "language": "go", + "ndcg10": 0.607, + "p50_ms": 38.0, + "index_ms": 20015.0 + }, + { + "repo": "gson", + "language": "java", + "ndcg10": 0.5272, + "p50_ms": 52.1, + "index_ms": 30942.0 + }, + { + "repo": "guzzle", + "language": "php", + "ndcg10": 0.5859, + "p50_ms": 27.7, + "index_ms": 4546.0 + }, + { + "repo": "http4s", + "language": "scala", + "ndcg10": 0.3079, + "p50_ms": 46.0, + "index_ms": 24269.0 + }, + { + "repo": "httpx", + "language": "python", + "ndcg10": 0.6149, + "p50_ms": 27.1, + "index_ms": 5459.0 + }, + { + "repo": "jackson-databind", + "language": "java", + "ndcg10": 0.1903, + "p50_ms": 98.4, + "index_ms": 92140.0 + }, + { + "repo": "kotlinx-coroutines", + "language": "kotlin", + "ndcg10": 0.4977, + "p50_ms": 43.8, + "index_ms": 19105.0 + }, + { + "repo": "ktor", + "language": "kotlin", + "ndcg10": 0.4514, + "p50_ms": 31.5, + "index_ms": 9397.0 + }, + { + "repo": "laravel-framework", + "language": "php", + "ndcg10": 0.297, + "p50_ms": 134.1, + "index_ms": 147392.0 + }, + { + "repo": "lazy.nvim", + "language": "lua", + "ndcg10": 0.5555, + "p50_ms": 32.8, + "index_ms": 12146.0 + }, + { + "repo": "libuv", + "language": "c", + "ndcg10": 0.502, + "p50_ms": 49.3, + "index_ms": 38233.0 + }, + { + "repo": "messagepack-csharp", + "language": "csharp", + "ndcg10": 0.3051, + "p50_ms": 45.2, + "index_ms": 23631.0 + }, + { + "repo": "mini.nvim", + "language": "lua", + "ndcg10": 1.0, + "p50_ms": 61.9, + "index_ms": 65813.0 + }, + { + "repo": "model2vec", + "language": "python", + "ndcg10": 0.4896, + "p50_ms": 28.7, + "index_ms": 6061.0 + }, + { + "repo": "monolog", + "language": "php", + "ndcg10": 0.3217, + "p50_ms": 33.9, + "index_ms": 12425.0 + }, + { + "repo": "newtonsoft-json", + "language": "csharp", + "ndcg10": 0.1495, + "p50_ms": 63.5, + "index_ms": 44247.0 + }, + { + "repo": "nlohmann-json", + "language": "cpp", + "ndcg10": 0.7863, + "p50_ms": 41.2, + "index_ms": 23650.0 + }, + { + "repo": "nvm", + "language": "bash", + "ndcg10": 1.0, + "p50_ms": 35.4, + "index_ms": 13357.0 + }, + { + "repo": "pandoc", + "language": "haskell", + "ndcg10": 0.1382, + "p50_ms": 72.9, + "index_ms": 66106.0 + }, + { + "repo": "phoenix", + "language": "elixir", + "ndcg10": 0.6589, + "p50_ms": 34.7, + "index_ms": 18193.0 + }, + { + "repo": "plug", + "language": "elixir", + "ndcg10": 0.6127, + "p50_ms": 34.5, + "index_ms": 10328.0 + }, + { + "repo": "pydantic", + "language": "python", + "ndcg10": 0.4918, + "p50_ms": 51.7, + "index_ms": 36380.0 + }, + { + "repo": "rack", + "language": "ruby", + "ndcg10": 0.5663, + "p50_ms": 30.4, + "index_ms": 9106.0 + }, + { + "repo": "rails", + "language": "ruby", + "ndcg10": 0.5675, + "p50_ms": 38.4, + "index_ms": 15155.0 + }, + { + "repo": "redis", + "language": "c", + "ndcg10": 0.5988, + "p50_ms": 124.7, + "index_ms": 167157.0 + }, + { + "repo": "redux", + "language": "javascript", + "ndcg10": 0.645, + "p50_ms": 22.8, + "index_ms": 4560.0 + }, + { + "repo": "requests", + "language": "python", + "ndcg10": 0.7508, + "p50_ms": 27.8, + "index_ms": 6695.0 + }, + { + "repo": "serde", + "language": "rust", + "ndcg10": 0.5056, + "p50_ms": 49.4, + "index_ms": 30975.0 + }, + { + "repo": "sinatra", + "language": "ruby", + "ndcg10": 0.7964, + "p50_ms": 26.1, + "index_ms": 4863.0 + }, + { + "repo": "snapkit", + "language": "swift", + "ndcg10": 0.4189, + "p50_ms": 26.0, + "index_ms": 5456.0 + }, + { + "repo": "starlette", + "language": "python", + "ndcg10": 0.6606, + "p50_ms": 30.5, + "index_ms": 7899.0 + }, + { + "repo": "telescope.nvim", + "language": "lua", + "ndcg10": 0.5419, + "p50_ms": 37.3, + "index_ms": 16376.0 + }, + { + "repo": "tokio", + "language": "rust", + "ndcg10": 0.5391, + "p50_ms": 69.9, + "index_ms": 76184.0 + }, + { + "repo": "trpc", + "language": "typescript", + "ndcg10": 0.4809, + "p50_ms": 34.8, + "index_ms": 10920.0 + }, + { + "repo": "vapor", + "language": "swift", + "ndcg10": 0.3022, + "p50_ms": 43.7, + "index_ms": 19099.0 + }, + { + "repo": "vitest", + "language": "typescript", + "ndcg10": 0.4334, + "p50_ms": 46.1, + "index_ms": 26371.0 + }, + { + "repo": "xmonad", + "language": "haskell", + "ndcg10": 0.6487, + "p50_ms": 32.2, + "index_ms": 6080.0 + }, + { + "repo": "zig", + "language": "zig", + "ndcg10": 0.7124, + "p50_ms": 199.4, + "index_ms": 350606.0 + }, + { + "repo": "zig-clap", + "language": "zig", + "ndcg10": 0.8083, + "p50_ms": 33.5, + "index_ms": 6076.0 + }, + { + "repo": "zls", + "language": "zig", + "ndcg10": 0.7444, + "p50_ms": 47.3, + "index_ms": 32747.0 + }, + { + "repo": "zod", + "language": "typescript", + "ndcg10": 0.2684, + "p50_ms": 56.6, + "index_ms": 52460.0 + } + ], + "avg_ndcg10": 0.5606, + "avg_p50_ms": 47.7, + "avg_index_ms": 34955.0 +} diff --git a/benchmarks/results/probe-715563a812c3.json b/benchmarks/results/probe-715563a812c3.json new file mode 100644 index 0000000..f9b725b --- /dev/null +++ b/benchmarks/results/probe-715563a812c3.json @@ -0,0 +1,385 @@ +{ + "tool": "probe-bm25", + "repos": [ + { + "repo": "abseil-cpp", + "language": "cpp", + "ndcg10": 0.1244, + "p50_ms": 111.9 + }, + { + "repo": "aeson", + "language": "haskell", + "ndcg10": 0.2037, + "p50_ms": 139.1 + }, + { + "repo": "aiohttp", + "language": "python", + "ndcg10": 0.4316, + "p50_ms": 140.9 + }, + { + "repo": "alamofire", + "language": "swift", + "ndcg10": 0.3013, + "p50_ms": 141.0 + }, + { + "repo": "axios", + "language": "javascript", + "ndcg10": 0.4722, + "p50_ms": 90.2 + }, + { + "repo": "axum", + "language": "rust", + "ndcg10": 0.28, + "p50_ms": 793.9 + }, + { + "repo": "bash-it", + "language": "bash", + "ndcg10": 0.1601, + "p50_ms": 86.4 + }, + { + "repo": "bats-core", + "language": "bash", + "ndcg10": 0.4114, + "p50_ms": 69.1 + }, + { + "repo": "cats", + "language": "scala", + "ndcg10": 0.3496, + "p50_ms": 101.0 + }, + { + "repo": "chi", + "language": "go", + "ndcg10": 0.28, + "p50_ms": 104.8 + }, + { + "repo": "circe", + "language": "scala", + "ndcg10": 0.4489, + "p50_ms": 101.6 + }, + { + "repo": "click", + "language": "python", + "ndcg10": 0.6472, + "p50_ms": 126.3 + }, + { + "repo": "cobra", + "language": "go", + "ndcg10": 0.532, + "p50_ms": 151.7 + }, + { + "repo": "commons-lang", + "language": "java", + "ndcg10": 0.5891, + "p50_ms": 263.7 + }, + { + "repo": "curl", + "language": "c", + "ndcg10": 0.2494, + "p50_ms": 412.0 + }, + { + "repo": "dapper", + "language": "csharp", + "ndcg10": 0.4014, + "p50_ms": 235.2 + }, + { + "repo": "ecto", + "language": "elixir", + "ndcg10": 0.3956, + "p50_ms": 124.9 + }, + { + "repo": "exposed", + "language": "kotlin", + "ndcg10": 0.3478, + "p50_ms": 113.6 + }, + { + "repo": "express", + "language": "javascript", + "ndcg10": 0.7438, + "p50_ms": 72.5 + }, + { + "repo": "fastapi", + "language": "python", + "ndcg10": 0.4201, + "p50_ms": 152.3 + }, + { + "repo": "flask", + "language": "python", + "ndcg10": 0.5163, + "p50_ms": 97.9 + }, + { + "repo": "fmtlib", + "language": "cpp", + "ndcg10": 0.4674, + "p50_ms": 369.8 + }, + { + "repo": "gin", + "language": "go", + "ndcg10": 0.4167, + "p50_ms": 123.8 + }, + { + "repo": "gson", + "language": "java", + "ndcg10": 0.4908, + "p50_ms": 127.5 + }, + { + "repo": "guzzle", + "language": "php", + "ndcg10": 0.397, + "p50_ms": 110.2 + }, + { + "repo": "http4s", + "language": "scala", + "ndcg10": 0.3769, + "p50_ms": 94.8 + }, + { + "repo": "httpx", + "language": "python", + "ndcg10": 0.5374, + "p50_ms": 95.7 + }, + { + "repo": "jackson-databind", + "language": "java", + "ndcg10": 0.5278, + "p50_ms": 217.0 + }, + { + "repo": "kotlinx-coroutines", + "language": "kotlin", + "ndcg10": 0.3092, + "p50_ms": 111.2 + }, + { + "repo": "ktor", + "language": "kotlin", + "ndcg10": 0.3482, + "p50_ms": 107.4 + }, + { + "repo": "laravel-framework", + "language": "php", + "ndcg10": 0.306, + "p50_ms": 188.6 + }, + { + "repo": "lazy.nvim", + "language": "lua", + "ndcg10": 0.3382, + "p50_ms": 86.8 + }, + { + "repo": "libuv", + "language": "c", + "ndcg10": 0.5078, + "p50_ms": 213.2 + }, + { + "repo": "messagepack-csharp", + "language": "csharp", + "ndcg10": 0.3902, + "p50_ms": 258.3 + }, + { + "repo": "mini.nvim", + "language": "lua", + "ndcg10": 0.4623, + "p50_ms": 269.9 + }, + { + "repo": "model2vec", + "language": "python", + "ndcg10": 0.4623, + "p50_ms": 88.1 + }, + { + "repo": "monolog", + "language": "php", + "ndcg10": 0.3177, + "p50_ms": 103.2 + }, + { + "repo": "newtonsoft-json", + "language": "csharp", + "ndcg10": 0.3832, + "p50_ms": 360.5 + }, + { + "repo": "nlohmann-json", + "language": "cpp", + "ndcg10": 0.5336, + "p50_ms": 391.6 + }, + { + "repo": "nvm", + "language": "bash", + "ndcg10": 0.1067, + "p50_ms": 137.5 + }, + { + "repo": "pandoc", + "language": "haskell", + "ndcg10": 0.2581, + "p50_ms": 162.0 + }, + { + "repo": "phoenix", + "language": "elixir", + "ndcg10": 0.3504, + "p50_ms": 97.5 + }, + { + "repo": "plug", + "language": "elixir", + "ndcg10": 0.4895, + "p50_ms": 89.5 + }, + { + "repo": "pydantic", + "language": "python", + "ndcg10": 0.3377, + "p50_ms": 263.7 + }, + { + "repo": "rack", + "language": "ruby", + "ndcg10": 0.3986, + "p50_ms": 135.1 + }, + { + "repo": "rails", + "language": "ruby", + "ndcg10": 0.2155, + "p50_ms": 256.4 + }, + { + "repo": "redis", + "language": "c", + "ndcg10": 0.3943, + "p50_ms": 1040.8 + }, + { + "repo": "redux", + "language": "javascript", + "ndcg10": 0.5492, + "p50_ms": 73.3 + }, + { + "repo": "requests", + "language": "python", + "ndcg10": 0.5147, + "p50_ms": 83.0 + }, + { + "repo": "serde", + "language": "rust", + "ndcg10": 0.1772, + "p50_ms": 873.0 + }, + { + "repo": "sinatra", + "language": "ruby", + "ndcg10": 0.5327, + "p50_ms": 68.4 + }, + { + "repo": "snapkit", + "language": "swift", + "ndcg10": 0.325, + "p50_ms": 67.0 + }, + { + "repo": "starlette", + "language": "python", + "ndcg10": 0.5292, + "p50_ms": 86.6 + }, + { + "repo": "telescope.nvim", + "language": "lua", + "ndcg10": 0.2082, + "p50_ms": 113.4 + }, + { + "repo": "tokio", + "language": "rust", + "ndcg10": 0.2686, + "p50_ms": 1235.1 + }, + { + "repo": "trpc", + "language": "typescript", + "ndcg10": 0.342, + "p50_ms": 115.1 + }, + { + "repo": "vapor", + "language": "swift", + "ndcg10": 0.2145, + "p50_ms": 85.4 + }, + { + "repo": "vitest", + "language": "typescript", + "ndcg10": 0.373, + "p50_ms": 168.5 + }, + { + "repo": "xmonad", + "language": "haskell", + "ndcg10": 0.4766, + "p50_ms": 90.0 + }, + { + "repo": "zig", + "language": "zig", + "ndcg10": 0.2973, + "p50_ms": 254.9 + }, + { + "repo": "zig-clap", + "language": "zig", + "ndcg10": 0.4813, + "p50_ms": 83.2 + }, + { + "repo": "zls", + "language": "zig", + "ndcg10": 0.3273, + "p50_ms": 125.0 + }, + { + "repo": "zod", + "language": "typescript", + "ndcg10": 0.3473, + "p50_ms": 396.7 + } + ], + "avg_ndcg10": 0.3872, + "avg_p50_ms": 207.1 +}