Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
68 changes: 68 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@
[Quickstart](#quickstart) •
[Main Features](#main-features) •
[MCP Server](#mcp-server) •
[CLI](#cli) •
[How it works](#how-it-works) •
[Benchmarks](#benchmarks)

Expand Down Expand Up @@ -119,6 +120,73 @@ Add to `~/.cursor/mcp.json` (or `.cursor/mcp.json` in your project):
| `search` | Search a codebase with a natural-language or code query. Pass `repo` as a git URL or local path. |
| `find_related` | Given a file path and line number, return chunks semantically similar to the code at that location. |

### Sub-agent support

Claude Code and Codex CLI lazy-load MCP tool schemas, so sub-agents cannot call `mcp__semble__search` directly. The fix is to invoke semble through the [CLI](#cli) via Bash instead.

**Claude Code**: run this once in your project root:

```bash
semble init
# or, if semble is not on $PATH:
uvx --from "semble[mcp]" semble init
```

This writes [`.claude/agents/semble-search.md`](src/semble/agents/semble-search.md).

**Other tools (Codex, etc.)**: append the following to your `AGENTS.md`:

```markdown
## Code Search

Use `semble search` to find code by describing what it does or naming a symbol/identifier, instead of grep:

​```bash
semble search "authentication flow" ./my-project
semble search "save_pretrained" ./my-project
semble search "save model to disk" ./my-project --top-k 10
​```

Use `semble find-related` to discover code similar to a known location (pass `file_path` and `line` from a prior search result):

​```bash
semble find-related src/auth.py 42 ./my-project
​```

`path` defaults to the current directory when omitted; git URLs are accepted.

If `semble` is not on `$PATH`, use `uvx --from "semble[mcp]" semble` in its place.

## Workflow

1. Start with `semble search` to find relevant chunks.
2. Inspect full files only when the returned chunk is not enough context.
3. Optionally use `semble find-related` with a promising result's `file_path` and `line` to discover related implementations.
4. Use grep only when you need exhaustive literal matches or quick confirmation of an exact string.
```

## CLI

Semble also ships as a standalone CLI for use outside of MCP. This is useful in scripts, sub-agents, or anywhere you want search results without an MCP session.

```bash
# Search a local repo
semble search "authentication flow" ./my-project

# Search for a symbol or identifier
semble search "save_pretrained" ./my-project

# Search a remote repo (cloned on demand)
semble search "save model to disk" https://github.com/MinishLab/model2vec

# Find code similar to a known location (file_path and line from a prior search result)
semble find-related src/auth.py 42 ./my-project
```

`path` defaults to the current directory when omitted; git URLs are accepted.

If `semble` is not on `$PATH`, use `uvx --from "semble[mcp]" semble` in its place.

## How it works

Semble splits each file into code-aware chunks using [Chonkie](https://github.com/chonkie-inc/chonkie), then scores every query against the chunks with two complementary retrievers: static [Model2Vec](https://github.com/MinishLab/model2vec) embeddings using the code-specialized [potion-code-16M](https://huggingface.co/minishlab/potion-code-16M) model for semantic similarity, and [BM25](https://github.com/xhluca/bm25s) for lexical matches on identifiers and API names. The two score lists are fused with Reciprocal Rank Fusion (RRF).
Expand Down
5 changes: 3 additions & 2 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -61,7 +61,7 @@ dev = [
"Source" = "https://github.com/MinishLab/semble"

[project.scripts]
semble = "semble.mcp:main"
semble = "semble.cli:main"

[tool.setuptools]
package-dir = {"" = "src"}
Expand All @@ -72,7 +72,7 @@ where = ["src"]
include = ["semble*"]

[tool.setuptools.package-data]
semble = ["py.typed"]
semble = ["py.typed", "agents/*.md"]

[tool.setuptools_scm]
# can be empty if no extra settings are needed, presence enables setuptools_scm
Expand All @@ -88,6 +88,7 @@ target-version = "py310"
[tool.ruff.lint.per-file-ignores]
"tests/**" = ["ANN"]
"benchmarks/*.py" = ["T20"]
"src/semble/cli.py" = ["T20", "E501"]

[tool.ruff.lint]
select = [
Expand Down
30 changes: 30 additions & 0 deletions src/semble/agents/semble-search.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
---
name: semble-search
description: Code search agent for exploring any codebase. Use for finding code by intent, locating implementations, understanding how something works, or discovering related code. Prefer over Grep/Glob/Read for any semantic or exploratory question.
tools: Bash, Read
---

Use `semble search` to find code by describing what it does or naming a symbol/identifier, instead of grep:

```bash
semble search "authentication flow" ./my-project
semble search "save_pretrained" ./my-project
semble search "save model to disk" ./my-project --top-k 10
```

Use `semble find-related` to discover code similar to a known location (pass `file_path` and `line` from a prior search result):

```bash
semble find-related src/auth.py 42 ./my-project
```

`path` defaults to the current directory when omitted; git URLs are accepted.

If `semble` is not on `$PATH`, use `uvx --from "semble[mcp]" semble` in its place.

## Workflow

1. Start with `semble search` to find relevant chunks.
2. Inspect full files only when the returned chunk is not enough context.
3. Optionally use `semble find-related` with a promising result's `file_path` and `line` to discover related implementations.
4. Use grep only when you need exhaustive literal matches or quick confirmation of an exact string.
97 changes: 97 additions & 0 deletions src/semble/cli.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,97 @@
import argparse
import asyncio
import sys
from importlib.resources import files
from pathlib import Path

from semble.index import SembleIndex
from semble.utils import _format_results, _is_git_url, _resolve_chunk

_CLAUDE_FILE_PATH = Path(".claude") / "agents" / "semble-search.md"
_CLI_DISPATCH_ARGS = frozenset({"search", "find-related", "init", "-h", "--help"})


def main() -> None:
"""Entry point for the semble command-line tool."""
if len(sys.argv) > 1 and sys.argv[1] in _CLI_DISPATCH_ARGS:
_cli_main()
else:
_mcp_main()


def _mcp_main() -> None:
parser = argparse.ArgumentParser(
prog="semble",
description="Instant local code search for agents.",
)
parser.add_argument(
"path",
nargs="?",
default=None,
help="Local directory or git URL to pre-index at startup (optional).",
)
parser.add_argument("--ref", default=None, help="Branch or tag to check out (git URLs only).")
args = parser.parse_args()
from semble.mcp import serve

asyncio.run(serve(args.path, ref=args.ref))


def _run_init(*, force: bool = False) -> None:
"""Write the Claude Code sub-agent file into the current project."""
dest = _CLAUDE_FILE_PATH
if dest.exists() and not force:
print(f"{dest} already exists. Run with --force to overwrite.", file=sys.stderr)
sys.exit(1)
dest.parent.mkdir(parents=True, exist_ok=True)
content = files("semble").joinpath("agents/semble-search.md").read_text(encoding="utf-8")
dest.write_text(content, encoding="utf-8")
print(f"Created {dest}")


def _cli_main() -> None:
parser = argparse.ArgumentParser(prog="semble")
sub = parser.add_subparsers(dest="command")

search_p = sub.add_parser("search", help="Search a codebase.")
search_p.add_argument("query", help="Natural language or code query.")
search_p.add_argument("path", nargs="?", default=".", help="Local path or git URL (default: current directory).")
search_p.add_argument("-k", "--top-k", type=int, default=5, help="Number of results (default: 5).")
search_p.add_argument(
"-m", "--mode", default="hybrid", choices=["hybrid", "semantic", "bm25"], help="Search mode (default: hybrid)."
)

related_p = sub.add_parser("find-related", help="Find code similar to a specific location.")
related_p.add_argument("file_path", help="File path as shown in search results.")
related_p.add_argument("line", type=int, help="Line number (1-indexed).")
related_p.add_argument("path", nargs="?", default=".", help="Local path or git URL (default: current directory).")
related_p.add_argument("-k", "--top-k", type=int, default=5, help="Number of results (default: 5).")

init_p = sub.add_parser("init", help="Write .claude/agents/semble-search.md for Claude Code sub-agent support.")
init_p.add_argument("--force", action="store_true", help="Overwrite if the file already exists.")

args = parser.parse_args()

if args.command == "init":
_run_init(force=args.force)
return

index = SembleIndex.from_git(args.path) if _is_git_url(args.path) else SembleIndex.from_path(args.path)

if args.command == "search":
results = index.search(args.query, top_k=args.top_k, mode=args.mode)
if not results:
print("No results found.")
else:
print(_format_results(f"Search results for: {args.query!r} (mode={args.mode})", results))

elif args.command == "find-related":
chunk = _resolve_chunk(index.chunks, args.file_path, args.line)
if chunk is None:
print(f"No chunk found at {args.file_path}:{args.line}.", file=sys.stderr)
sys.exit(1)
results = index.find_related(chunk, top_k=args.top_k)
if not results:
print(f"No related chunks found for {args.file_path}:{args.line}.")
else:
print(_format_results(f"Chunks related to {args.file_path}:{args.line}", results))
68 changes: 2 additions & 66 deletions src/semble/mcp.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,6 @@
from __future__ import annotations

import asyncio
import re
from pathlib import Path
from typing import Annotated, Literal

Expand All @@ -10,18 +9,15 @@

from semble.index import SembleIndex
from semble.index.dense import load_model
from semble.types import Chunk, Encoder, SearchResult
from semble.types import Encoder
from semble.utils import _format_results, _is_git_url, _resolve_chunk

_REPO_DESCRIPTION = (
"Git URL (e.g. https://github.com/org/repo) or local path to index and search. "
"Required when no default index was configured at startup. "
"The index is cached after the first call, so repeat queries are fast."
)

_GIT_URL_SCHEMES = ("https://", "http://", "ssh://", "git://", "git+ssh://", "file://")
# scp-like syntax: [user@]host:path, where host has no '/' before the ':'.
_SCP_GIT_URL_RE = re.compile(r"^[\w.-]+@[\w.-]+:(?!/)")


def create_server(cache: _IndexCache, default_source: str | None = None) -> FastMCP:
"""Build and return a configured FastMCP server backed by the given cache."""
Expand Down Expand Up @@ -142,70 +138,10 @@ async def get(self, source: str, ref: str | None = None) -> SembleIndex:
try:
return await asyncio.shield(task)
except asyncio.CancelledError: # pragma: no cover
# If this waiter was cancelled but the task is still running, preserve it for
# other waiters. Only evict if the task itself was cancelled.
if task.done():
self._tasks.pop(cache_key, None)
raise
except Exception:
# Build failed: evict so the next caller can retry.
self._tasks.pop(cache_key, None)
raise


def _resolve_chunk(chunks: list[Chunk], file_path: str, line: int) -> Chunk | None:
"""Return the chunk that contains *line* in *file_path*, or None.

MCP tool arguments are JSON primitives (strings and ints), so the agent
passes file_path + line rather than a Chunk object. This function
reconstructs the Chunk at the MCP boundary before calling into the library.

:param chunks: All indexed chunks to search.
:param file_path: File path as stored in the index.
:param line: 1-indexed line number to resolve.
:return: The best-matching Chunk, or None if not found.
"""
fallback = None
for chunk in chunks:
if chunk.file_path == file_path and chunk.start_line <= line <= chunk.end_line:
if line < chunk.end_line:
return chunk
if fallback is None: # line == end_line: boundary; keep as fallback for end-of-file chunks
fallback = chunk
return fallback


def _is_git_url(path: str) -> bool:
"""Return True if path looks like a remote git URL rather than a local path."""
return path.startswith(_GIT_URL_SCHEMES) or _SCP_GIT_URL_RE.match(path) is not None


def _format_results(header: str, results: list[SearchResult]) -> str:
"""Render SearchResult objects as numbered, fenced code blocks."""
lines: list[str] = [header, ""]
for i, r in enumerate(results, 1):
lines.append(f"## {i}. {r.chunk.location} [score={r.score:.3f}]")
lines.append("```")
lines.append(r.chunk.content.strip())
lines.append("```")
lines.append("")
return "\n".join(lines)


def main() -> None:
"""Entry point for the semble command-line tool."""
import argparse

parser = argparse.ArgumentParser(
prog="semble",
description="Instant local code search for agents.",
)
parser.add_argument(
"path",
nargs="?",
default=None,
help="Local directory or git URL to pre-index at startup (optional).",
)
parser.add_argument("--ref", default=None, help="Branch or tag to check out (git URLs only).")
args = parser.parse_args()
asyncio.run(serve(args.path, ref=args.ref))
41 changes: 41 additions & 0 deletions src/semble/utils.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
from __future__ import annotations

import re

from semble.types import Chunk, SearchResult

_GIT_URL_SCHEMES = ("https://", "http://", "ssh://", "git://", "git+ssh://", "file://")
_SCP_GIT_URL_RE = re.compile(r"^[\w.-]+@[\w.-]+:(?!/)")


def _is_git_url(path: str) -> bool:
"""Return True if path looks like a remote git URL rather than a local path."""
return path.startswith(_GIT_URL_SCHEMES) or _SCP_GIT_URL_RE.match(path) is not None


def _resolve_chunk(chunks: list[Chunk], file_path: str, line: int) -> Chunk | None:
"""Return the chunk containing *line* in *file_path*, or None.

Reconstructs a Chunk from its JSON-primitive MCP tool arguments (file_path + line)
before calling into the library.
"""
fallback = None
for chunk in chunks:
if chunk.file_path == file_path and chunk.start_line <= line <= chunk.end_line:
if line < chunk.end_line:
return chunk
if fallback is None: # line == end_line: boundary; keep as fallback for end-of-file chunks
fallback = chunk
return fallback


def _format_results(header: str, results: list[SearchResult]) -> str:
"""Render SearchResult objects as numbered, fenced code blocks."""
lines: list[str] = [header, ""]
for i, r in enumerate(results, 1):
lines.append(f"## {i}. {r.chunk.location} [score={r.score:.3f}]")
lines.append("```")
lines.append(r.chunk.content.strip())
lines.append("```")
lines.append("")
return "\n".join(lines)
2 changes: 1 addition & 1 deletion src/semble/version.py
Original file line number Diff line number Diff line change
@@ -1,2 +1,2 @@
__version_triple__ = (0, 1, 0)
__version_triple__ = (0, 1, 1)
__version__ = ".".join(map(str, __version_triple__))
Loading
Loading