Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -149,6 +149,8 @@ We benchmark quality and speed across all methods on ~1,250 queries over 63 repo
| CodeRankEmbed | 0.765 | 57 s | 16 ms |
| ColGREP | 0.693 | 5.8 s | 124 ms |
| BM25 | 0.673 | 263 ms | 0.02 ms |
| grepai | 0.561 | 35 s | 48 ms |
| probe | 0.387 | — | 207 ms |
| ripgrep | 0.126 | — | 12 ms |

Semble achieves 99% of the performance of the 137M-parameter [CodeRankEmbed](https://huggingface.co/nomic-ai/CodeRankEmbed) Hybrid, while indexing 218x faster and answering queries 11x faster. See [benchmarks](benchmarks/README.md) for per-language results, ablations, and methodology.
Expand Down
Binary file modified assets/images/speed_vs_ndcg_cold.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified assets/images/speed_vs_ndcg_warm.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
92 changes: 70 additions & 22 deletions benchmarks/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ Quality and speed benchmarks for `semble`.
- [Ablations](#ablations)
- [Dataset](#dataset)
- [Methods](#methods)
- [Excluded methods](#excluded-methods)
- [Running the benchmarks](#running-the-benchmarks)

## Main results
Expand All @@ -20,6 +21,8 @@ Quality and speed across all methods.
| CodeRankEmbed | 0.765 | 57 s | 16 ms |
| ColGREP | 0.693 | 5.8 s | 124 ms |
| BM25 | 0.673 | 263 ms | 0.02 ms |
| grepai | 0.561 | 35 s | 48 ms |
| probe | 0.387 | — | 207 ms |
| ripgrep | 0.126 | — | 12 ms |

| ![Speed vs quality (cold)](../assets/images/speed_vs_ndcg_cold.png) | ![Speed vs quality (warm)](../assets/images/speed_vs_ndcg_warm.png) |
Expand All @@ -34,28 +37,28 @@ NDCG@10 is averaged across all queries. Speed numbers use one repo per language,

NDCG@10 per language, sorted by CodeRankEmbed Hybrid (CRE in the table). Best score per row is bolded.

| Language | semble | CRE Hybrid | CRE | ColGREP | ripgrep |
|---|---:|---:|---:|---:|---:|
| scala | 0.909 | **0.922** | 0.845 | 0.765 | 0.180 |
| cpp | **0.915** | 0.913 | 0.846 | 0.626 | 0.126 |
| ruby | **0.909** | **0.909** | 0.769 | 0.708 | 0.230 |
| elixir | 0.894 | **0.905** | 0.869 | 0.808 | 0.134 |
| javascript | 0.917 | 0.903 | **0.920** | 0.823 | 0.176 |
| zig | **0.913** | 0.901 | 0.807 | 0.474 | 0.000 |
| csharp | 0.885 | **0.889** | 0.743 | 0.614 | 0.117 |
| go | **0.895** | 0.884 | 0.676 | 0.785 | 0.133 |
| python | 0.867 | **0.880** | 0.794 | 0.777 | 0.202 |
| php | 0.858 | **0.874** | 0.758 | 0.663 | 0.123 |
| swift | 0.860 | **0.873** | 0.721 | 0.710 | 0.160 |
| bash | 0.825 | 0.852 | **0.892** | 0.706 | 0.000 |
| lua | 0.823 | **0.847** | 0.803 | 0.798 | 0.000 |
| java | **0.849** | 0.841 | 0.706 | 0.641 | 0.198 |
| kotlin | 0.821 | **0.830** | 0.670 | 0.637 | 0.166 |
| rust | **0.856** | 0.827 | 0.627 | 0.662 | 0.162 |
| c | 0.741 | **0.806** | 0.706 | 0.676 | 0.000 |
| haskell | 0.765 | 0.771 | **0.776** | 0.683 | 0.000 |
| typescript | 0.706 | **0.708** | 0.545 | 0.430 | 0.128 |
| **overall** | **0.854** | **0.862** | **0.765** | **0.693** | **0.126** |
| Language | semble | CRE Hybrid | CRE | ColGREP | grepai | probe | ripgrep |
|---|---:|---:|---:|---:|---:|---:|---:|
| scala | 0.909 | **0.922** | 0.845 | 0.765 | 0.330 | 0.392 | 0.180 |
| cpp | **0.915** | 0.913 | 0.846 | 0.626 | 0.731 | 0.375 | 0.126 |
| ruby | **0.909** | **0.909** | 0.769 | 0.708 | 0.643 | 0.382 | 0.230 |
| elixir | 0.894 | **0.905** | 0.869 | 0.808 | 0.669 | 0.412 | 0.134 |
| javascript | 0.917 | 0.903 | **0.920** | 0.823 | 0.675 | 0.588 | 0.176 |
| zig | **0.913** | 0.901 | 0.807 | 0.474 | 0.755 | 0.369 | 0.000 |
| csharp | 0.885 | **0.889** | 0.743 | 0.614 | 0.277 | 0.392 | 0.117 |
| go | **0.895** | 0.884 | 0.676 | 0.785 | 0.722 | 0.410 | 0.133 |
| python | 0.867 | **0.880** | 0.794 | 0.777 | 0.634 | 0.488 | 0.202 |
| php | 0.858 | **0.874** | 0.758 | 0.663 | 0.402 | 0.340 | 0.123 |
| swift | 0.860 | **0.873** | 0.721 | 0.710 | 0.429 | 0.280 | 0.160 |
| bash | 0.825 | 0.852 | **0.892** | 0.706 | 0.723 | 0.226 | 0.000 |
| lua | 0.823 | **0.847** | 0.803 | 0.798 | 0.699 | 0.336 | 0.000 |
| java | **0.849** | 0.841 | 0.706 | 0.641 | 0.386 | 0.536 | 0.198 |
| kotlin | 0.821 | **0.830** | 0.670 | 0.637 | 0.478 | 0.335 | 0.166 |
| rust | **0.856** | 0.827 | 0.627 | 0.662 | 0.519 | 0.242 | 0.162 |
| c | 0.741 | **0.806** | 0.706 | 0.676 | 0.555 | 0.384 | 0.000 |
| haskell | 0.765 | 0.771 | **0.776** | 0.683 | 0.483 | 0.313 | 0.000 |
| typescript | 0.706 | **0.708** | 0.545 | 0.430 | 0.394 | 0.354 | 0.128 |
| **overall** | **0.854** | **0.862** | **0.765** | **0.693** | **0.561** | **0.387** | **0.126** |

## Ablations

Expand Down Expand Up @@ -102,10 +105,19 @@ NDCG@10 per language, sorted by CodeRankEmbed Hybrid (CRE in the table). Best sc
## Methods

- **[ripgrep](https://github.com/BurntSushi/ripgrep)**: fast regex search over files, included as a raw keyword-match baseline.
- **[probe](https://github.com/buger/probe)**: BM25 keyword ranking backed by tree-sitter parse trees. No persistent index; scans on the fly.
- **[ColGREP](https://github.com/lightonai/next-plaid/tree/main/colgrep)**: late-interaction code retrieval built on next-plaid with the [LateOn-Code-edge](https://huggingface.co/lightonai/LateOn-Code-edge) model.
- **[grepai](https://github.com/nicholasgasior/grepai)**: semantic search using [nomic-embed-text](https://huggingface.co/nomic-ai/nomic-embed-text-v1) (137M params) via a local Ollama daemon.
- **[CodeRankEmbed](https://huggingface.co/nomic-ai/CodeRankEmbed)**: 137M-param transformer embedding model for code retrieval. *CodeRankEmbed Hybrid* fuses its dense scores with BM25.
- **[semble](https://github.com/your-repo/semble)**: this library. [potion-code-16M](https://huggingface.co/minishlab/potion-code-16M) static embeddings + BM25 + the semble reranking stack.

## Excluded methods

Two tools were considered but not included in the benchmark:

- **[codanna](https://codanna.io)**: symbol-level semantic search with fastembed. Excluded because it does not support Haskell, Bash, Zig, Scala, Elixir, or Ruby (6 of the 19 benchmark languages, covering 20 of 63 repos (~38% of tasks)).
- **[claude-context](https://github.com/zilliztech/claude-context)**: retrieval-augmented code search using OpenAI embeddings and a vector database. Excluded because it requires a paid OpenAI API key and a running vector-DB service.

## Running the benchmarks

Repos are pinned in `repos.json` and cloned into `~/.cache/semble-bench`:
Expand Down Expand Up @@ -152,6 +164,42 @@ uv run python -m benchmarks.baselines.ablations --mode semble-semantic

</details>

<details>
<summary>probe</summary>

Needs `probe` on `$PATH` (`npm install -g @buger/probe`).

```bash
uv run python -m benchmarks.baselines.probe
uv run python -m benchmarks.baselines.probe --repo fastapi --repo axios
```

</details>

<details>
<summary>grepai</summary>

Needs `grepai` on `$PATH` and Ollama running with `nomic-embed-text` pulled:

```bash
ollama pull nomic-embed-text
```

```bash
uv run python -m benchmarks.baselines.grepai
uv run python -m benchmarks.baselines.grepai --repo fastapi --repo axios
```

Large repos take several minutes to index. Use `--timeout <seconds>` (default 120) for repos with many files:

```bash
uv run python -m benchmarks.baselines.grepai --timeout 1800 --output results.json
```

The `--output` flag enables resume mode: already-completed repos are skipped on restart.

</details>

<details>
<summary>ripgrep</summary>

Expand Down
Loading
Loading