diff --git a/README.md b/README.md index ba1659e..ce1ad94 100644 --- a/README.md +++ b/README.md @@ -108,6 +108,15 @@ Wire it into Claude Code (`~/.claude.json`): `--read-only` opens the DB with a shared lock and hides the `execute` tool. Full docs + the other six tools' references in [`docs/mcp.md`](docs/mcp.md). +### End-to-end example apps + +Beyond the per-language quickstarts in [`examples/`](examples/), the SQLR-38 umbrella tracks longer, opinionated example apps that exercise SQLRite in real-world shapes: + +| App | SDK | What it shows | +|---|---|---| +| [Python LLM agent with persistent memory](examples/python-agent/) | Python | Vector + lexical recall, fact extraction, summaries — all in one `.sqlrite` file | +| [Chat-with-your-notes via Claude Desktop MCP](examples/nodejs-notes/) | Node.js | Markdown → hybrid HNSW + BM25 index → `sqlrite-mcp --read-only` → Claude Desktop | + ### Developer guide In-depth documentation lives under [`docs/`](docs/). Start at [`docs/_index.md`](docs/_index.md) — it navigates to: diff --git a/examples/README.md b/examples/README.md index 040da24..c98ea78 100644 --- a/examples/README.md +++ b/examples/README.md @@ -23,6 +23,7 @@ Beyond the per-SDK quick-start tours above, the [SQLR-38 umbrella](../docs/roadm | App | Language / SDK | What it shows | Directory | |---|---|---|---| | LLM agent with persistent memory | Python | Vector + lexical recall, fact extraction, summaries — all in one `.sqlrite` file | [`python-agent/`](python-agent/) | +| Chat with your notes (MCP) | Node.js | Markdown → SQLRite hybrid retrieval, served to Claude Desktop via `sqlrite-mcp --read-only` | [`nodejs-notes/`](nodejs-notes/) | ## Running the Rust quickstart @@ -82,6 +83,16 @@ python -m sqlrite_agent # works offline; no API key required A full CLI chat agent whose long-term memory is one `.sqlrite` file. Embeds each turn, hybrid-searches over past messages and a structured `facts` table on every recall, and survives process restarts. Read [`python-agent/README.md`](python-agent/README.md) for the demo script and architecture diagram. +## Running the Node.js notes assistant (SQLR-40) + +```bash +cd examples/nodejs-notes +npm install +node bin/sqlrite-notes.mjs init ~/Documents/notes +``` + +Ingests a folder of markdown notes into a `notes.sqlrite` file with HNSW + BM25 indexes, then `sqlrite-notes serve` wraps `sqlrite-mcp --read-only` so **Claude Desktop / any MCP client** can `bm25_search` / `vector_search` / `query` / `ask` your local notes directly — no cloud sync, no third-party indexer. Default embedder is fully offline (deterministic hash bag-of-words); flip to `--embedder openai` with `OPENAI_API_KEY` set for real semantic recall. Read [`nodejs-notes/README.md`](nodejs-notes/README.md) for the Claude Desktop config snippet and the hybrid-retrieval SQL walkthrough. + ## Running the Node.js sample ```bash diff --git a/examples/nodejs-notes/.gitignore b/examples/nodejs-notes/.gitignore new file mode 100644 index 0000000..5577b65 --- /dev/null +++ b/examples/nodejs-notes/.gitignore @@ -0,0 +1,6 @@ +node_modules/ +*.sqlrite +*.sqlrite-journal +.env +.env.local +test-fixtures-tmp/ diff --git a/examples/nodejs-notes/README.md b/examples/nodejs-notes/README.md new file mode 100644 index 0000000..cd814a6 --- /dev/null +++ b/examples/nodejs-notes/README.md @@ -0,0 +1,344 @@ +# sqlrite-notes — chat with your markdown notes via Claude Desktop + +A Node.js CLI that ingests a folder of markdown notes (Obsidian +vault, Notion export, plain `~/Documents/notes`) into a SQLRite +database, then exposes the database to **Claude Desktop / any MCP +client** through the engine's first-party MCP server. + +End-user effect: drop your notes folder in, paste one block into +Claude Desktop's config, and ask Claude *"what did I write about +CRDTs last month?"* — it answers using your local notes. No cloud +sync, no third-party indexer, the entire memory is one `.sqlrite` +file on disk you can open in the REPL. + +> **Why this example?** Other "chat with your notes" demos build a +> custom RAG pipeline and bolt it onto a model. This one shows that +> when the database itself speaks the agent protocol, you don't need +> a pipeline — *Claude drives the database directly* via +> `sqlrite-mcp`. The Node.js side is just the ingest + glue. + +## Architecture + +```mermaid +flowchart LR + Notes[/"~/notes/*.md
(markdown)"/] -->|sqlrite-notes init / refresh| Ingest + Ingest["Ingest pipeline
(walk → chunk → embed → store)"] --> DB[("notes.sqlrite
documents · chunks
HNSW + FTS indexes")] + DB -->|"sqlrite-mcp --read-only
stdio JSON-RPC"| Claude["Claude Desktop
(or any MCP client)"] + Claude -->|"vector_search · bm25_search · query · ask"| DB +``` + +The whole stack: Node.js for the **write side** (ingest pipeline, +chunking, embeddings), SQLRite for **storage + retrieval primitives** +(HNSW vector index, BM25 inverted index, raw SQL), and `sqlrite-mcp` +for the **read side** that Claude actually talks to. The Node CLI +never touches the database while Claude is connected — that's what +`--read-only` is for. + +## Schema (v1) + +| Table | Purpose | Indexes | +|-------------|------------------------------------------------------------------------------------------------|----------------------------------------------------------------------| +| `documents` | One row per `.md` file — path, title, mtime, full body, content hash. | UNIQUE on `path`. FTS on `content` (BM25 over whole docs). | +| `chunks` | One row per ~400-token slice of a document, plus a `VECTOR(384)` embedding. | HNSW on `embedding` (semantic KNN). FTS on `content` (passage BM25). | + +Hybrid retrieval queries `chunks` and fuses BM25 + vector cosine in a +single `ORDER BY` (see [`docs/fts.md`](../../docs/fts.md) for the +SQL pattern; the executor's `try_fts_probe` hook serves the top-k +straight from the inverted index). + +## Install + +The example lives inside the SQLRite monorepo for now (the umbrella +ticket SQLR-38 will lift it into its own repo once we've shipped a +few more). + +```bash +git clone https://github.com/joaoh82/rust_sqlite +cd rust_sqlite/examples/nodejs-notes +npm install +``` + +`npm install` pulls **`@joaoh82/sqlrite`** (pinned to `^0.10.0`) with +prebuilt napi-rs binaries for macOS-arm64, Linux x64/arm64, and +Windows x64 — no Rust toolchain required for the Node side. + +`sqlrite-mcp` is a separate Rust binary. Install it once, anywhere +on your `PATH`: + +```bash +# from crates.io (~30s): +cargo install sqlrite-mcp + +# or grab a prebuilt binary from GitHub Releases: +# https://github.com/joaoh82/rust_sqlite/releases +``` + +If you don't want to install globally, set `SQLRITE_MCP_BIN` to its +absolute path — `sqlrite-notes serve` will pick it up. + +## Run + +```bash +# 1. Ingest a folder of markdown into a notes.sqlrite database. +node bin/sqlrite-notes.mjs init ~/Documents/notes + +# 2. Confirm it works locally — same retrieval shape Claude will see. +node bin/sqlrite-notes.mjs search "what did I learn about CRDTs?" + +# 3. Wire up Claude Desktop using the snippet printed by `init` +# (also available any time via `sqlrite-notes config`). + +# 4. Open Claude Desktop. The sqlrite-mcp tools appear in the +# tool picker — `bm25_search`, `vector_search`, `query`, `ask`, +# plus `list_tables` / `describe_table` / `schema_dump`. +``` + +Once you've added the snippet to `claude_desktop_config.json` and +restarted Claude Desktop, run a chat like: + +> *"Summarize what I've written about Postgres over the last month."* + +Claude will call `bm25_search` (and/or `vector_search`) against the +`chunks` table, get back the matching passages, and answer with +inline citations to the file path and chunk number. + +## Zero-config: works fully offline + +The default embedder is a **deterministic hash-based bag-of-words** +embedder that runs in pure JavaScript. No API key, no network, +nothing to install — `sqlrite-notes init ~/Documents/notes` works +on a fresh laptop. + +Hybrid retrieval still beats either signal alone because BM25 is +already doing exact-term ranking; the hash embedder mostly carries +its weight via the long-tail of co-occurring tokens. + +For real semantic recall, switch to OpenAI: + +```bash +export OPENAI_API_KEY=sk-... +node bin/sqlrite-notes.mjs init ~/Documents/notes --embedder openai +``` + +Uses `text-embedding-3-small` with the `dimensions: 384` override +so it matches the schema. Override the model with +`--openai-model text-embedding-3-large` (and bump `--dim` if you +want full-fat dimensionality). + +## CLI surface + +``` +sqlrite-notes init Ingest into the notes DB (replaces the index). +sqlrite-notes refresh Re-ingest only files whose mtime/hash changed. +sqlrite-notes search "" Hybrid retrieval against the index (debug). +sqlrite-notes serve Spawn sqlrite-mcp --read-only against the DB. +sqlrite-notes stats Row counts. +sqlrite-notes config Print the Claude Desktop config snippet again. +``` + +Common flags: + +``` +--db Path to the SQLRite database file. Default: ~/.sqlrite-notes/notes.sqlrite +--embedder hash|openai Embedding provider. Default: hash (offline). +--dim Vector dimension. Default: 384. +--openai-model OpenAI embedding model. Default: text-embedding-3-small. +--chunk-tokens Target chunk size in tokens. Default: 400. +--chunk-overlap Chunk overlap in tokens. Default: 60. +``` + +## Claude Desktop config + +`sqlrite-notes init` prints this block; you can also regenerate it +any time with `sqlrite-notes config --db `: + +```json +{ + "mcpServers": { + "sqlrite-notes": { + "command": "sqlrite-notes", + "args": ["serve", "--db", "/Users/you/.sqlrite-notes/notes.sqlrite"] + } + } +} +``` + +Where it lives: + +| Platform | Path | +|---|---| +| macOS | `~/Library/Application Support/Claude/claude_desktop_config.json` | +| Linux | `${XDG_CONFIG_HOME:-~/.config}/Claude/claude_desktop_config.json` | +| Windows | `%APPDATA%\Claude\claude_desktop_config.json` | + +Merge the snippet into any existing `mcpServers` block; don't +overwrite the file wholesale. + +### Try it without Claude Desktop + +If you don't have Claude Desktop, the same MCP server works with +**any** MCP client — Cursor, Codex, your own. The fastest way to +verify the wiring is Anthropic's inspector: + +```bash +npx @modelcontextprotocol/inspector sqlrite-notes serve --db ./notes.sqlrite +``` + +Open the URL it prints, click through the tools, type JSON args. +Saves a lot of "did Claude restart correctly?" debugging. + +## Open the DB yourself + +The notes index is plain SQLRite — open it in the REPL whenever: + +```bash +$ cargo install sqlrite-engine # or grab a release binary +$ sqlrite ./notes.sqlrite +SQLRite v0.10.0 +sqlrite> SELECT path, title FROM documents ORDER BY mtime DESC LIMIT 5; +sqlrite> SELECT path, ord, substr(content, 1, 80) + ...> FROM chunks + ...> JOIN documents ON chunks.document_id = documents.id + ...> WHERE fts_match(chunks.content, 'rust embedded') + ...> ORDER BY bm25_score(chunks.content, 'rust embedded') DESC + ...> LIMIT 5; +``` + +This is the demo's whole point: **the notes index is just SQL**. You +can query it, back it up, copy it between machines, or feed the same +file into the Python / Go / WASM SDKs without converting anything. + +## How hybrid retrieval works + +The `search` command (and Claude, indirectly, when it composes +`query` calls) runs the canonical hybrid shape from +[`docs/fts.md`](../../docs/fts.md): + +```sql +SELECT id, document_id, ord, content FROM chunks + WHERE fts_match(content, 'collaborative editing CRDT') + ORDER BY 0.5 * bm25_score(content, 'collaborative editing CRDT') + + 0.5 * (1.0 - vec_distance_cosine(embedding, [/* query embedding */])) +DESC LIMIT 5; +``` + +Two things worth noting: + +1. `vec_distance_cosine` returns a *distance* (`1 - cos(a, b)`). + Hybrid scoring wants "higher is better", so we invert it. +2. `fts_match` pre-filters before scoring. Paraphrases with zero + shared tokens never get scored — a deliberate tradeoff. If a + query produces no FTS tokens at all (e.g. a single non-ASCII + word), `db.hybridSearch()` falls back to vector-only so Claude + always has *something* to ground on. + +The `-w 0..1` flag on `sqlrite-notes search` tunes the BM25 / vector +balance. At `-w 1` you get pure BM25; at `-w 0` pure vector cosine. +Default is 0.5. + +## Known limitations + +- **Concurrent `refresh` while `serve` is running.** SQLRite shipped + `BEGIN CONCURRENT` writes in v0.10 (the SQLR-22 / Phase 11 work), + so reading from `sqlrite-mcp --read-only` while `sqlrite-notes + refresh` writes is supported in principle. *In practice we + recommend stopping Claude Desktop's MCP connection during a + refresh* — Claude Desktop caches the tool schemas at server-spawn + time and won't notice newly-ingested files until you reload it + anyway, so the gain from coexisting isn't worth the surprise. + +- **HNSW after delete + re-insert (engine bug, SQLR-8).** The + engine's HNSW chunk index panics when rows are deleted and re- + inserted within the same connection lifetime — `index out of + bounds` inside `DistanceMetric::compute`. The ingest pipeline + works around it by splitting refresh into a delete-phase → + close/reopen → insert-phase, which forces a clean index rebuild + on next open. We pay the close/reopen hop only when there are + actual deletions (so first-time `init` skips it). Once SQLR-8 + lands in the engine, `src/ingest.mjs` can drop the `db.reopen()` + call. + +- **Aggregates limited.** The engine supports `COUNT(*)`, + `SUM`/`AVG`/`MIN`/`MAX` but not arbitrary expressions in `SELECT` + projection beyond aggregates (see [`docs/supported-sql.md`](../../docs/supported-sql.md)). + The `stats` command sticks to `COUNT(*)`. + +- **No parameter binding yet.** Engine SDKs don't yet accept `?` + placeholders ([Phase 5a.2](../../docs/roadmap.md) follow-up); all + SQL strings inline literals via the `q()` helper in + [`src/sqlutil.mjs`](src/sqlutil.mjs). That helper handles + quoting + vector-literal encoding for the four value types we use + (string / int / vector / null). + +- **Re-ingest persistence.** `refresh` requires you to pass the + source directory again (we don't yet store it inside the DB). If + you forget the path, run `sqlrite-notes stats --db ` to + confirm the index exists, then re-run `init` to rebuild from + scratch — it's idempotent. + +- **`--watch` mode** — not in v1. Re-run `refresh` manually after + adding notes. + +- **Authentication.** None. This is a single-user local tool; the + MCP server inherits the spawner's filesystem privileges. Don't + point it at a database you wouldn't share with whoever can read + your home directory. + +## Development + +```bash +npm install +npm test # offline; runs all 40 unit + integration tests +``` + +The test suite uses `node:test` and exercises: + +- `chunker` (frontmatter, title derivation, overlap windowing) +- `sqlutil` (every value type the engine accepts as a SQL literal) +- `embeddings` (hash determinism + a mocked OpenAI HTTP call) +- `db` (schema migration, upsert/delete/cascade, hybrid retrieval) +- `ingest` (end-to-end against the markdown fixtures in + [`test/fixtures/`](test/fixtures/)) +- `serve` + `claude-config` (binary lookup, snippet shape) + +A few tests need the `@joaoh82/sqlrite` engine binding installed; +they skip cleanly (with a message) if it isn't. + +## Layout + +``` +examples/nodejs-notes/ +├── package.json # @joaoh82/sqlrite pinned to ^0.10.0 +├── README.md # this file +├── bin/ +│ └── sqlrite-notes.mjs # entry — calls src/cli.mjs +├── src/ +│ ├── cli.mjs # argv parsing + command dispatch +│ ├── config.mjs # defaults + path resolution +│ ├── sqlutil.mjs # q() + ident() — SQL literal helpers +│ ├── db.mjs # schema, migrations, every SQL string +│ ├── chunker.mjs # frontmatter + heading-aware chunking +│ ├── embeddings.mjs # hash + OpenAI embedders +│ ├── ingest.mjs # plan → delete → reopen → insert +│ ├── search.mjs # hybrid retrieval driver +│ ├── serve.mjs # spawn sqlrite-mcp --read-only +│ └── claude-config.mjs # Claude Desktop snippet renderer +└── test/ # node --test suite + ├── fixtures/ # 3 short markdown notes + ├── chunker.test.mjs + ├── claude-config.test.mjs + ├── db.test.mjs # integration — needs @joaoh82/sqlrite + ├── embeddings.test.mjs + ├── ingest.test.mjs # integration — needs @joaoh82/sqlrite + ├── serve.test.mjs + └── sqlutil.test.mjs +``` + +The example binds only to the documented public surfaces: +[`@joaoh82/sqlrite`](../../sdk/nodejs/) (`Database`, `Statement`) +and [`sqlrite-mcp`](../../docs/mcp.md) as a child process. It does +not reach into engine internals. + +## License + +MIT — same as the rest of the rust_sqlite repo. diff --git a/examples/nodejs-notes/bin/sqlrite-notes.mjs b/examples/nodejs-notes/bin/sqlrite-notes.mjs new file mode 100644 index 0000000..b3c7e5d --- /dev/null +++ b/examples/nodejs-notes/bin/sqlrite-notes.mjs @@ -0,0 +1,15 @@ +#!/usr/bin/env node +// Entry point — keep this thin so installations can `chmod +x` +// the file and dispatch into the real CLI module. +import { run } from '../src/cli.mjs'; + +run(process.argv.slice(2)).then( + (code) => process.exit(code ?? 0), + (err) => { + process.stderr.write(`sqlrite-notes: ${err.message ?? err}\n`); + if (process.env.SQLRITE_NOTES_DEBUG) { + process.stderr.write(`${err.stack ?? ''}\n`); + } + process.exit(1); + }, +); diff --git a/examples/nodejs-notes/package-lock.json b/examples/nodejs-notes/package-lock.json new file mode 100644 index 0000000..02fab66 --- /dev/null +++ b/examples/nodejs-notes/package-lock.json @@ -0,0 +1,31 @@ +{ + "name": "sqlrite-notes", + "version": "0.1.0", + "lockfileVersion": 3, + "requires": true, + "packages": { + "": { + "name": "sqlrite-notes", + "version": "0.1.0", + "license": "MIT", + "dependencies": { + "@joaoh82/sqlrite": "^0.10.0" + }, + "bin": { + "sqlrite-notes": "bin/sqlrite-notes.mjs" + }, + "engines": { + "node": ">=20" + } + }, + "node_modules/@joaoh82/sqlrite": { + "version": "0.10.0", + "resolved": "https://registry.npmjs.org/@joaoh82/sqlrite/-/sqlrite-0.10.0.tgz", + "integrity": "sha512-zHYElNw7gFVO5fke5SvFtz8y+rVAVEFz0DZuNGM0QtWTEVwCk3/MdQHV03zo3yPK/y6zxPXyKY2yvGFpN/1uVw==", + "license": "MIT", + "engines": { + "node": ">= 18" + } + } + } +} diff --git a/examples/nodejs-notes/package.json b/examples/nodejs-notes/package.json new file mode 100644 index 0000000..4cd463d --- /dev/null +++ b/examples/nodejs-notes/package.json @@ -0,0 +1,38 @@ +{ + "name": "sqlrite-notes", + "version": "0.1.0", + "private": true, + "description": "Example: a Node.js CLI that ingests a folder of markdown notes into SQLRite (hybrid HNSW + BM25 retrieval), then exposes the database to Claude Desktop / any MCP client via sqlrite-mcp.", + "type": "module", + "engines": { + "node": ">=20" + }, + "bin": { + "sqlrite-notes": "bin/sqlrite-notes.mjs" + }, + "main": "src/cli.mjs", + "scripts": { + "test": "node --test 'test/**/*.test.mjs'" + }, + "dependencies": { + "@joaoh82/sqlrite": "^0.10.0" + }, + "keywords": [ + "sqlrite", + "mcp", + "claude", + "notes", + "obsidian", + "rag", + "vector-search", + "bm25" + ], + "author": "SQLRite contributors", + "license": "MIT", + "repository": { + "type": "git", + "url": "https://github.com/joaoh82/rust_sqlite", + "directory": "examples/nodejs-notes" + }, + "homepage": "https://sqlritedb.com" +} diff --git a/examples/nodejs-notes/src/chunker.mjs b/examples/nodejs-notes/src/chunker.mjs new file mode 100644 index 0000000..804cfac --- /dev/null +++ b/examples/nodejs-notes/src/chunker.mjs @@ -0,0 +1,191 @@ +// Markdown chunker. +// +// Split a document into ~`targetTokens`-sized chunks with optional +// overlap. The chunker keeps three rules: +// +// 1. Never split mid-paragraph — paragraph boundaries (blank lines) +// are the smallest atomic unit. +// 2. Carry the closest preceding heading into each chunk as a +// one-line prefix. Without it, mid-document chunks lose every +// hint of structure and embeddings + BM25 both degrade. +// 3. Token counting is approximate — `text.split(/\s+/).length` +// is close enough for chunking, and dodges a heavy tokenizer +// dependency. The retrieval side never needs exact counts. + +const DEFAULT_TARGET = 400; +const DEFAULT_OVERLAP = 60; + +/** + * Strip YAML frontmatter from a markdown document. Returns + * `{ frontmatter, body }`. Frontmatter is returned as the raw text + * between the fences (no YAML parsing — we only consult it for the + * title). + * + * @param {string} text + * @returns {{ frontmatter: string, body: string }} + */ +export function stripFrontmatter(text) { + if (!text.startsWith('---\n') && !text.startsWith('---\r\n')) { + return { frontmatter: '', body: text }; + } + const re = /^---\r?\n([\s\S]*?)\r?\n---\r?\n?/; + const m = text.match(re); + if (!m) return { frontmatter: '', body: text }; + return { frontmatter: m[1], body: text.slice(m[0].length) }; +} + +/** + * Derive a title for the document. + * + * 1. `title:` field in YAML frontmatter, if present. + * 2. First `#` / `##` / `###` heading in the body. + * 3. The filename stem, supplied by the caller. + * + * @param {{ frontmatter: string, body: string, fallback: string }} args + * @returns {string} + */ +export function deriveTitle({ frontmatter, body, fallback }) { + if (frontmatter) { + const m = frontmatter.match(/^title\s*:\s*(?:["']?)(.+?)(?:["']?)\s*$/m); + if (m) return m[1].trim(); + } + const heading = body.match(/^#{1,6}\s+(.+?)\s*$/m); + if (heading) return heading[1].trim(); + return fallback; +} + +/** + * Approximate token count — `whitespace-separated word count` is a + * fine proxy at the granularities we care about (~hundreds). + * + * @param {string} text + */ +export function approxTokens(text) { + if (!text) return 0; + return text.trim().split(/\s+/).filter(Boolean).length; +} + +/** + * Chunk markdown body text into ~targetTokens-sized passages. + * + * @param {string} body + * @param {{ targetTokens?: number, overlapTokens?: number }} [opts] + * @returns {Array<{ ord: number, content: string }>} + */ +export function chunkMarkdown(body, opts = {}) { + const target = opts.targetTokens ?? DEFAULT_TARGET; + const overlap = opts.overlapTokens ?? DEFAULT_OVERLAP; + + // Step 1: walk paragraphs, attaching the most-recent heading to + // each one. Keep the paragraph text verbatim so embeddings see the + // original prose. + const blocks = paragraphsWithHeadings(body); + + // Step 2: greedy pack paragraphs into chunks until we cross the + // token target. Paragraphs that exceed the target on their own get + // their own chunk — never split mid-paragraph. + const chunks = []; + let current = []; // { heading, text }[] + let currentTokens = 0; + let lastHeading = ''; + + function flush() { + if (current.length === 0) return; + chunks.push(renderChunk(current)); + current = []; + currentTokens = 0; + } + + for (const block of blocks) { + const blockTokens = approxTokens(block.text); + // Keep the heading context — if it changed, we'll surface the new + // one in this chunk, but otherwise we don't repeat it. + if (currentTokens > 0 && currentTokens + blockTokens > target) { + flush(); + } + if (current.length === 0 && block.heading) { + lastHeading = block.heading; + current.push({ heading: block.heading, text: '' }); + } else if (block.heading && block.heading !== lastHeading) { + lastHeading = block.heading; + current.push({ heading: block.heading, text: '' }); + } + current.push({ heading: '', text: block.text }); + currentTokens += blockTokens; + } + flush(); + + // Step 3: apply overlap by prepending the tail of the previous + // chunk to the current one. Overlap reduces the chance that a + // matching sentence sits exactly on a chunk boundary. + if (overlap > 0 && chunks.length > 1) { + for (let i = chunks.length - 1; i > 0; i--) { + const prev = chunks[i - 1]; + const tail = trailingTokens(prev, overlap); + if (tail) { + chunks[i] = `${tail}\n\n${chunks[i]}`; + } + } + } + + return chunks + .map((content, ord) => ({ ord, content: content.trim() })) + .filter((c) => c.content.length > 0); +} + +// ------------------------------------------------------------------ +// Helpers + +function paragraphsWithHeadings(body) { + const lines = body.split(/\r?\n/); + /** @type {Array<{ heading: string, text: string }>} */ + const blocks = []; + let currentHeading = ''; + let buffer = []; + + function flushBuffer() { + const text = buffer.join('\n').trim(); + if (text) blocks.push({ heading: currentHeading, text }); + buffer = []; + } + + for (const line of lines) { + const headingMatch = line.match(/^(#{1,6})\s+(.+?)\s*$/); + if (headingMatch) { + flushBuffer(); + currentHeading = headingMatch[2].trim(); + // Also emit the heading itself as a tiny block so it can be the + // sole content of a chunk if nothing follows. + blocks.push({ heading: currentHeading, text: line.trim() }); + continue; + } + if (line.trim() === '') { + flushBuffer(); + } else { + buffer.push(line); + } + } + flushBuffer(); + return blocks; +} + +function renderChunk(parts) { + // Concat the parts (mix of heading-only blocks and paragraph + // blocks) into a single text body. Headings are kept inline so the + // embedded text reads naturally. + const out = []; + for (const p of parts) { + if (p.heading) { + out.push(`# ${p.heading}`); + } else if (p.text) { + out.push(p.text); + } + } + return out.join('\n\n'); +} + +function trailingTokens(text, n) { + const tokens = text.split(/\s+/).filter(Boolean); + if (tokens.length <= n) return text; + return tokens.slice(tokens.length - n).join(' '); +} diff --git a/examples/nodejs-notes/src/claude-config.mjs b/examples/nodejs-notes/src/claude-config.mjs new file mode 100644 index 0000000..396d0b7 --- /dev/null +++ b/examples/nodejs-notes/src/claude-config.mjs @@ -0,0 +1,65 @@ +// Generate a copy-paste-ready `claude_desktop_config.json` snippet so +// `sqlrite-notes init` can finish with "now paste this into Claude +// Desktop". We never WRITE the file (too easy to clobber the user's +// other MCP servers) — printing is honest and obvious. + +import { claudeDesktopConfigPath } from './config.mjs'; + +/** + * Build the per-server config block. Uses `sqlrite-notes serve --db + * ` so the wiring stays the same regardless of where the user + * installed `sqlrite-mcp`. + * + * @param {{ dbPath: string, binPath?: string }} args + */ +export function buildConfig({ dbPath, binPath }) { + const command = binPath ?? 'sqlrite-notes'; + return { + mcpServers: { + 'sqlrite-notes': { + command, + args: ['serve', '--db', dbPath], + }, + }, + }; +} + +/** + * Render the JSON block and the surrounding "wire-up" instructions. + * + * @param {{ dbPath: string, binPath?: string }} args + * @returns {string} + */ +export function renderInstructions({ dbPath, binPath }) { + const cfg = buildConfig({ dbPath, binPath }); + const json = JSON.stringify(cfg, null, 2); + const cfgPath = claudeDesktopConfigPath(); + return [ + '─── Wire up Claude Desktop ───────────────────────────────────', + '', + `1. Open or create:`, + ` ${cfgPath}`, + '', + '2. Merge this block into the file (preserving any other', + ` "mcpServers" you already have):`, + '', + indent(json), + '', + '3. Restart Claude Desktop. The "sqlrite-notes" tools should', + ' appear in the tool picker on the next chat.', + '', + 'Tip: use Anthropic\'s MCP Inspector to dry-run the server before', + 'pointing Claude Desktop at it:', + '', + ` npx @modelcontextprotocol/inspector sqlrite-notes serve --db ${JSON.stringify(dbPath)}`, + '', + '──────────────────────────────────────────────────────────────', + ].join('\n'); +} + +function indent(text) { + return text + .split(/\r?\n/) + .map((l) => ` ${l}`) + .join('\n'); +} diff --git a/examples/nodejs-notes/src/cli.mjs b/examples/nodejs-notes/src/cli.mjs new file mode 100644 index 0000000..69731b6 --- /dev/null +++ b/examples/nodejs-notes/src/cli.mjs @@ -0,0 +1,350 @@ +// Top-level CLI dispatcher. Subcommands: +// +// init — create / replace the notes DB by ingesting . +// refresh — incremental re-ingest based on mtime + hash. +// search "" — debug retrieval the way Claude would over MCP. +// serve — spawn sqlrite-mcp --read-only against the DB. +// stats — quick row counts. +// config — print the Claude Desktop wiring snippet. +// +// Flag parsing uses node:util's parseArgs so we have no external +// dep just for argv handling. Each subcommand owns its own option +// schema. Unknown / missing args print usage. + +import { parseArgs } from 'node:util'; + +import { NotesDB } from './db.mjs'; +import { ingest, refresh } from './ingest.mjs'; +import { makeEmbedder } from './embeddings.mjs'; +import { search, renderResults } from './search.mjs'; +import { spawnMcpServer } from './serve.mjs'; +import { renderInstructions } from './claude-config.mjs'; +import { + resolveDbPath, + resolveDir, + defaultDbPath, + DEFAULT_EMBEDDING_DIM, + DEFAULT_CHUNK_TOKENS, + DEFAULT_CHUNK_OVERLAP, +} from './config.mjs'; + +const VERSION = '0.1.0'; + +const USAGE = `sqlrite-notes ${VERSION} — chat with your markdown notes via Claude Desktop + SQLRite MCP. + +Usage: + sqlrite-notes [options] + +Commands: + init Build (or rebuild) the notes index from . + refresh Incremental re-ingest based on file mtime/hash. + search "" Run hybrid retrieval against the index (debug). + serve Spawn sqlrite-mcp --read-only against the DB. + stats Print row counts. + config Print the Claude Desktop config snippet. + help Show this message. + +Common options: + --db Path to the SQLRite database file. + Default: ${defaultDbPath()} + --embedder hash|openai Embedding provider (default: hash, offline). + --dim Vector dimension (default: ${DEFAULT_EMBEDDING_DIM}). + --openai-model OpenAI embedding model (default: text-embedding-3-small). + +Init / refresh options: + --chunk-tokens Target chunk size in tokens (default: ${DEFAULT_CHUNK_TOKENS}). + --chunk-overlap Chunk overlap in tokens (default: ${DEFAULT_CHUNK_OVERLAP}). + +Search options: + -k Number of results to return (default: 5). + -w <0..1> BM25 vs vector weight (default: 0.5). + +Environment: + OPENAI_API_KEY Required when --embedder openai. + SQLRITE_NOTES_EMBEDDER Default embedder (hash | openai). + SQLRITE_NOTES_OPENAI_MODEL Override OpenAI model id. + SQLRITE_MCP_BIN Explicit path to sqlrite-mcp for 'serve'. +`; + +/** + * Entry point. Returns the process exit code (0 = OK). + * + * @param {string[]} argv arguments after `node bin/sqlrite-notes.mjs` + */ +export async function run(argv) { + const [command, ...rest] = argv; + if (!command || command === 'help' || command === '--help' || command === '-h') { + process.stdout.write(USAGE); + return 0; + } + if (command === 'version' || command === '--version' || command === '-V') { + process.stdout.write(`sqlrite-notes ${VERSION}\n`); + return 0; + } + + switch (command) { + case 'init': + return cmdInit(rest); + case 'refresh': + return cmdRefresh(rest); + case 'search': + return cmdSearch(rest); + case 'serve': + return cmdServe(rest); + case 'stats': + return cmdStats(rest); + case 'config': + return cmdConfig(rest); + default: + process.stderr.write(`unknown command: ${command}\n\n`); + process.stderr.write(USAGE); + return 2; + } +} + +// ------------------------------------------------------------------ +// init + +async function cmdInit(argv) { + const { values, positionals } = parseArgs({ + args: argv, + allowPositionals: true, + options: { + db: { type: 'string' }, + embedder: { type: 'string' }, + dim: { type: 'string' }, + 'openai-model': { type: 'string' }, + 'chunk-tokens': { type: 'string' }, + 'chunk-overlap': { type: 'string' }, + }, + }); + if (positionals.length === 0) { + process.stderr.write('init: missing \n\nusage: sqlrite-notes init [--db path] [--embedder hash|openai]\n'); + return 2; + } + const root = resolveDir(positionals[0]); + const dbPath = resolveDbPath(values.db); + const dim = parseDim(values.dim); + const embedder = makeEmbedder({ + kind: values.embedder, + dim, + model: values['openai-model'], + }); + const db = new NotesDB(dbPath, { dim: embedder.dim }); + + try { + process.stdout.write(`sqlrite-notes ${VERSION}\n`); + process.stdout.write(` db: ${dbPath}\n`); + process.stdout.write(` source: ${root}\n`); + process.stdout.write(` embedder: ${embedder.name} (dim=${embedder.dim})\n`); + + const stats = await ingest({ + db, + root, + embedder, + logger: (s) => process.stdout.write(`${s}\n`), + chunkOpts: parseChunkOpts(values), + }); + process.stdout.write(`\ningested ${stats.files} file(s), ${stats.chunks} chunk(s) in ${stats.elapsedMs} ms\n`); + process.stdout.write('\n'); + process.stdout.write(renderInstructions({ dbPath })); + process.stdout.write('\n'); + return 0; + } finally { + db.close(); + } +} + +// ------------------------------------------------------------------ +// refresh + +async function cmdRefresh(argv) { + const { values, positionals } = parseArgs({ + args: argv, + allowPositionals: true, + options: { + db: { type: 'string' }, + embedder: { type: 'string' }, + dim: { type: 'string' }, + 'openai-model': { type: 'string' }, + 'chunk-tokens': { type: 'string' }, + 'chunk-overlap': { type: 'string' }, + source: { type: 'string' }, + }, + }); + // is optional for refresh — if omitted, we re-ingest the same + // tree that init recorded. (For now we just require it; we don't + // store the source dir in the DB. Documented in the README.) + const rootArg = values.source ?? positionals[0]; + if (!rootArg) { + process.stderr.write( + 'refresh: pass the source directory as a positional (or --source ).\n' + + 'We don\'t yet persist the source path inside the DB — see the README\n' + + '"Known simplifications" section.\n', + ); + return 2; + } + const root = resolveDir(rootArg); + const dbPath = resolveDbPath(values.db); + const dim = parseDim(values.dim); + const embedder = makeEmbedder({ + kind: values.embedder, + dim, + model: values['openai-model'], + }); + const db = new NotesDB(dbPath, { dim: embedder.dim }); + try { + const stats = await refresh({ + db, + root, + embedder, + logger: (s) => process.stdout.write(`${s}\n`), + chunkOpts: parseChunkOpts(values), + }); + process.stdout.write( + `refreshed: ${stats.files} updated, ${stats.skipped} unchanged, ${stats.deleted} deleted (${stats.elapsedMs} ms)\n`, + ); + return 0; + } finally { + db.close(); + } +} + +// ------------------------------------------------------------------ +// search + +async function cmdSearch(argv) { + const { values, positionals } = parseArgs({ + args: argv, + allowPositionals: true, + options: { + db: { type: 'string' }, + embedder: { type: 'string' }, + dim: { type: 'string' }, + 'openai-model': { type: 'string' }, + k: { type: 'string', short: 'k' }, + w: { type: 'string', short: 'w' }, + }, + }); + const query = positionals.join(' ').trim(); + if (!query) { + process.stderr.write('search: missing query string.\n\nusage: sqlrite-notes search "" [-k N] [-w 0..1]\n'); + return 2; + } + const dbPath = resolveDbPath(values.db); + const dim = parseDim(values.dim); + const embedder = makeEmbedder({ + kind: values.embedder, + dim, + model: values['openai-model'], + }); + const db = new NotesDB(dbPath, { dim: embedder.dim, readOnly: true }); + try { + const hits = await search({ + db, + embedder, + query, + k: parseInt2(values.k, 5), + weight: parseFloat2(values.w, 0.5), + }); + process.stdout.write(renderResults(query, hits)); + return 0; + } finally { + db.close(); + } +} + +// ------------------------------------------------------------------ +// serve + +async function cmdServe(argv) { + const { values } = parseArgs({ + args: argv, + options: { + db: { type: 'string' }, + }, + }); + const dbPath = resolveDbPath(values.db); + // sqlrite-mcp opens its own database, so we don't touch it here — + // just pass the resolved path through. + const code = await spawnMcpServer({ dbPath }); + return code; +} + +// ------------------------------------------------------------------ +// stats + +async function cmdStats(argv) { + const { values } = parseArgs({ + args: argv, + options: { + db: { type: 'string' }, + }, + }); + const dbPath = resolveDbPath(values.db); + const db = new NotesDB(dbPath, { readOnly: true }); + try { + const s = db.stats(); + process.stdout.write(`db: ${dbPath}\n`); + process.stdout.write(`documents: ${s.documents}\n`); + process.stdout.write(`chunks: ${s.chunks}\n`); + process.stdout.write(`embedding dim: ${s.dim}\n`); + return 0; + } finally { + db.close(); + } +} + +// ------------------------------------------------------------------ +// config + +async function cmdConfig(argv) { + const { values } = parseArgs({ + args: argv, + options: { + db: { type: 'string' }, + bin: { type: 'string' }, + }, + }); + const dbPath = resolveDbPath(values.db); + process.stdout.write(renderInstructions({ dbPath, binPath: values.bin })); + process.stdout.write('\n'); + return 0; +} + +// ------------------------------------------------------------------ +// shared option parsing + +function parseDim(raw) { + if (raw === undefined) return DEFAULT_EMBEDDING_DIM; + const n = parseInt(raw, 10); + if (!Number.isFinite(n) || n <= 0) { + throw new Error(`--dim: invalid value ${JSON.stringify(raw)}`); + } + return n; +} + +function parseChunkOpts(values) { + return { + targetTokens: parseInt2(values['chunk-tokens'], DEFAULT_CHUNK_TOKENS), + overlapTokens: parseInt2(values['chunk-overlap'], DEFAULT_CHUNK_OVERLAP), + }; +} + +function parseInt2(raw, fallback) { + if (raw === undefined) return fallback; + const n = parseInt(raw, 10); + if (!Number.isFinite(n) || n < 0) { + throw new Error(`invalid integer: ${JSON.stringify(raw)}`); + } + return n; +} + +function parseFloat2(raw, fallback) { + if (raw === undefined) return fallback; + const n = parseFloat(raw); + if (!Number.isFinite(n)) { + throw new Error(`invalid number: ${JSON.stringify(raw)}`); + } + return n; +} diff --git a/examples/nodejs-notes/src/config.mjs b/examples/nodejs-notes/src/config.mjs new file mode 100644 index 0000000..1984be1 --- /dev/null +++ b/examples/nodejs-notes/src/config.mjs @@ -0,0 +1,71 @@ +// Defaults + small helpers around config paths and the database +// location. Everything is overridable via flags on the CLI; this +// module just picks reasonable fallbacks. + +import { homedir, platform } from 'node:os'; +import { resolve, join } from 'node:path'; + +export const DEFAULT_EMBEDDING_DIM = 384; +export const DEFAULT_CHUNK_TOKENS = 400; +export const DEFAULT_CHUNK_OVERLAP = 60; + +/** Resolve the default DB path: ~/.sqlrite-notes/notes.sqlrite */ +export function defaultDbPath() { + return join(homedir(), '.sqlrite-notes', 'notes.sqlrite'); +} + +/** + * Resolve a user-supplied directory path. Expands `~` and resolves + * relative paths against the current working directory. + * + * @param {string} input + * @returns {string} + */ +export function resolveDir(input) { + if (!input) throw new Error('resolveDir(): empty path'); + let expanded = input; + if (expanded === '~' || expanded.startsWith('~/')) { + expanded = join(homedir(), expanded.slice(1)); + } + return resolve(expanded); +} + +/** + * Resolve a user-supplied DB path. Same expansion rules as + * `resolveDir` but doesn't require the parent directory to exist — + * the caller (db.mjs) will mkdir as needed. + * + * @param {string | undefined} input + * @returns {string} + */ +export function resolveDbPath(input) { + return resolveDir(input ?? defaultDbPath()); +} + +/** + * Best-guess location of Claude Desktop's config file. + * Used only for the `init`'s "wire me up" hint — we never read or + * write the file from the CLI. + * + * @returns {string} + */ +export function claudeDesktopConfigPath() { + if (platform() === 'darwin') { + return join( + homedir(), + 'Library', + 'Application Support', + 'Claude', + 'claude_desktop_config.json', + ); + } + if (platform() === 'win32') { + const appData = + process.env.APPDATA ?? join(homedir(), 'AppData', 'Roaming'); + return join(appData, 'Claude', 'claude_desktop_config.json'); + } + // Linux — Claude Desktop's Linux build is in beta; this is the + // documented path. Falls back to ~/.config if XDG_CONFIG_HOME unset. + const xdg = process.env.XDG_CONFIG_HOME ?? join(homedir(), '.config'); + return join(xdg, 'Claude', 'claude_desktop_config.json'); +} diff --git a/examples/nodejs-notes/src/db.mjs b/examples/nodejs-notes/src/db.mjs new file mode 100644 index 0000000..f57ce5e --- /dev/null +++ b/examples/nodejs-notes/src/db.mjs @@ -0,0 +1,400 @@ +// SQLRite-backed storage for the notes index. +// +// Owns the schema, migrations, and every SQL string in the project. +// Higher-level modules (ingest.mjs, search.mjs) call into `NotesDB` +// rather than touching SQL directly. +// +// Schema v1 — two tables: +// +// documents(id, path, title, mtime, content, content_hash) +// FTS index on `content`. +// +// chunks(id, document_id, ord, content, embedding VECTOR(dim)) +// HNSW index on `embedding`, FTS index on `content`. +// +// One row per file in `documents`; one row per ~400-token slice in +// `chunks`. Hybrid retrieval queries `chunks` (vector + BM25, fused +// at the SQL level) and joins back to `documents` for path / title. + +import { mkdirSync } from 'node:fs'; +import { dirname } from 'node:path'; + +import { Database } from '@joaoh82/sqlrite'; + +import { q, ident } from './sqlutil.mjs'; +import { DEFAULT_EMBEDDING_DIM } from './config.mjs'; + +const SCHEMA_VERSION = 1; + +export class NotesDB { + /** + * Open or create a notes database at `path`. Pass `:memory:` for a + * transient store (useful in tests). + * + * @param {string} path + * @param {{ dim?: number, readOnly?: boolean }} [opts] + */ + constructor(path, opts = {}) { + this.path = path; + this.dim = opts.dim ?? DEFAULT_EMBEDDING_DIM; + + if (path !== ':memory:') { + mkdirSync(dirname(path), { recursive: true }); + } + + this._db = opts.readOnly + ? Database.openReadOnly(path) + : new Database(path); + + if (!opts.readOnly) { + this._migrate(); + } + } + + // ------------------------------------------------------------------ + // Migrations + + _migrate() { + const cur = this._db; + let current = 0; + try { + const row = cur.prepare('SELECT version FROM schema_version').get(); + current = row?.version ?? 0; + } catch { + // schema_version table doesn't exist yet — fresh database. + cur.exec( + 'CREATE TABLE schema_version (version INTEGER PRIMARY KEY)', + ); + cur.exec( + `INSERT INTO schema_version (version) VALUES (${q(0)})`, + ); + } + + if (current < 1) { + this._applyV1(); + cur.exec(`DELETE FROM schema_version`); + cur.exec( + `INSERT INTO schema_version (version) VALUES (${q(SCHEMA_VERSION)})`, + ); + } + } + + _applyV1() { + const dim = this.dim; + this._db.exec(` + CREATE TABLE documents ( + id INTEGER PRIMARY KEY, + path TEXT NOT NULL UNIQUE, + title TEXT NOT NULL, + mtime INTEGER NOT NULL, + content TEXT NOT NULL, + content_hash TEXT NOT NULL + ) + `); + this._db.exec(` + CREATE TABLE chunks ( + id INTEGER PRIMARY KEY, + document_id INTEGER NOT NULL, + ord INTEGER NOT NULL, + content TEXT NOT NULL, + embedding VECTOR(${dim}) + ) + `); + // FTS indexes give us BM25 ranking via `bm25_score(col, 'q')` — + // both documents.content (whole-document hits) and chunks.content + // (passage-level hits) are useful surfaces. + this._db.exec('CREATE INDEX idx_documents_fts ON documents USING fts (content)'); + this._db.exec('CREATE INDEX idx_chunks_fts ON chunks USING fts (content)'); + // HNSW for semantic KNN over chunk embeddings. + this._db.exec('CREATE INDEX idx_chunks_emb ON chunks USING hnsw (embedding)'); + } + + // ------------------------------------------------------------------ + // Writes + + /** + * Upsert a document by `path`. Returns `{ id, replaced }` — `replaced` + * is true if a previous version of the document was removed first. + * + * Chunks are NOT touched here; the caller is responsible for calling + * `replaceChunks(id, ...)` after re-chunking + re-embedding. + * + * @param {{ path: string, title: string, mtime: number, content: string, contentHash: string }} doc + * @returns {{ id: number, replaced: boolean }} + */ + upsertDocument(doc) { + const existing = this._db + .prepare(`SELECT id FROM documents WHERE path = ${q(doc.path)}`) + .get(); + let replaced = false; + + if (existing) { + replaced = true; + // Delete existing chunks first — referential consistency. + this._db.exec(`DELETE FROM chunks WHERE document_id = ${q(existing.id)}`); + this._db.exec(`DELETE FROM documents WHERE id = ${q(existing.id)}`); + } + + this._db.exec( + `INSERT INTO documents (path, title, mtime, content, content_hash) VALUES (` + + `${q(doc.path)}, ${q(doc.title)}, ${q(doc.mtime)}, ${q(doc.content)}, ${q(doc.contentHash)})`, + ); + const inserted = this._db + .prepare(`SELECT id FROM documents WHERE path = ${q(doc.path)}`) + .get(); + if (!inserted) throw new Error('upsertDocument(): row vanished after INSERT'); + return { id: inserted.id, replaced }; + } + + /** + * Insert one chunk row. Embedding must match `this.dim`. + * + * @param {{ documentId: number, ord: number, content: string, embedding: number[] }} chunk + */ + insertChunk({ documentId, ord, content, embedding }) { + if (embedding.length !== this.dim) { + throw new Error( + `insertChunk(): embedding dim ${embedding.length} ≠ schema dim ${this.dim}`, + ); + } + this._db.exec( + `INSERT INTO chunks (document_id, ord, content, embedding) VALUES (` + + `${q(documentId)}, ${q(ord)}, ${q(content)}, ${q(embedding)})`, + ); + } + + /** + * Drop a document and every chunk pointing at it. + * + * @param {number} documentId + */ + deleteDocument(documentId) { + this._db.exec(`DELETE FROM chunks WHERE document_id = ${q(documentId)}`); + this._db.exec(`DELETE FROM documents WHERE id = ${q(documentId)}`); + } + + // ------------------------------------------------------------------ + // Reads + + /** + * Map of path → { id, mtime, content_hash }. Used by `refresh` to + * decide which files changed. + * + * @returns {Map} + */ + listDocuments() { + const rows = this._db + .prepare('SELECT id, path, mtime, content_hash FROM documents') + .all(); + const map = new Map(); + for (const r of rows) { + map.set(r.path, { + id: r.id, + mtime: r.mtime, + contentHash: r.content_hash, + }); + } + return map; + } + + /** + * Hybrid top-k search over chunks. Combines BM25 lexical with vector + * cosine in a single `ORDER BY` (see `docs/fts.md`). + * + * If `query` produces no FTS tokens (e.g. a single non-ASCII word), + * we fall back to vector-only ranking — otherwise the FTS pre-filter + * would return an empty set. + * + * @param {{ query: string, embedding: number[], k?: number, weight?: number }} args + * @returns {Array<{ chunk_id: number, document_id: number, path: string, title: string, ord: number, content: string }>} + */ + hybridSearch({ query, embedding, k = 5, weight = 0.5 }) { + if (embedding.length !== this.dim) { + throw new Error( + `hybridSearch(): embedding dim ${embedding.length} ≠ schema dim ${this.dim}`, + ); + } + const tokens = ftsTokenize(query); + const ftsQuery = tokens.join(' '); + const w = clamp01(weight); + + let chunkRows; + if (ftsQuery.length === 0) { + chunkRows = this._db + .prepare( + `SELECT id, document_id, ord, content FROM chunks ` + + `ORDER BY vec_distance_cosine(embedding, ${q(embedding)}) ASC ` + + `LIMIT ${q(k)}`, + ) + .all(); + } else { + // Hybrid: fts_match pre-filters, ORDER BY fuses BM25 + cosine. + chunkRows = this._db + .prepare( + `SELECT id, document_id, ord, content FROM chunks ` + + `WHERE fts_match(content, ${q(ftsQuery)}) ` + + `ORDER BY ${q(w)} * bm25_score(content, ${q(ftsQuery)}) ` + + `+ ${q(1 - w)} * (1.0 - vec_distance_cosine(embedding, ${q(embedding)})) ` + + `DESC LIMIT ${q(k)}`, + ) + .all(); + // If FTS pre-filter happened to find nothing (every token is + // unknown to the index), fall back to vector-only so the agent + // always gets *some* recall to ground on. + if (chunkRows.length === 0) { + chunkRows = this._db + .prepare( + `SELECT id, document_id, ord, content FROM chunks ` + + `ORDER BY vec_distance_cosine(embedding, ${q(embedding)}) ASC ` + + `LIMIT ${q(k)}`, + ) + .all(); + } + } + + return chunkRows.map((row) => { + const doc = this._db + .prepare( + `SELECT path, title FROM documents WHERE id = ${q(row.document_id)}`, + ) + .get(); + return { + chunk_id: row.id, + document_id: row.document_id, + path: doc?.path ?? '', + title: doc?.title ?? '', + ord: row.ord, + content: row.content, + }; + }); + } + + /** + * BM25 top-k over `documents.content` — useful for the debug + * `search --mode=bm25-docs` shape. + * + * @param {string} query + * @param {number} k + */ + bm25DocumentsSearch(query, k = 5) { + const tokens = ftsTokenize(query); + if (tokens.length === 0) return []; + const ftsQuery = tokens.join(' '); + return this._db + .prepare( + `SELECT id, path, title FROM documents ` + + `WHERE fts_match(content, ${q(ftsQuery)}) ` + + `ORDER BY bm25_score(content, ${q(ftsQuery)}) DESC ` + + `LIMIT ${q(k)}`, + ) + .all(); + } + + /** Quick row counts for `stats`. */ + stats() { + const dRow = this._db.prepare('SELECT COUNT(*) AS c FROM documents').get(); + const cRow = this._db.prepare('SELECT COUNT(*) AS c FROM chunks').get(); + return { + documents: Number(dRow?.c ?? 0), + chunks: Number(cRow?.c ?? 0), + dim: this.dim, + }; + } + + // ------------------------------------------------------------------ + // Transactions + + /** + * Run `fn` inside a single transaction. Commits on success, rolls + * back on any thrown error. Synchronous — the engine is sync. + * + * @template T + * @param {() => T} fn + * @returns {T} + */ + transaction(fn) { + this._db.exec('BEGIN'); + try { + const result = fn(); + this._db.exec('COMMIT'); + return result; + } catch (err) { + try { + this._db.exec('ROLLBACK'); + } catch { + // Ignore — the engine is in an unknown state; surface the + // original error to the caller. + } + throw err; + } + } + + /** Raw escape hatch — used by tests for ad-hoc SQL. */ + raw() { + return this._db; + } + + /** + * Close the underlying engine connection and re-open it at the same + * path. Used by the ingest pipeline to work around the engine's + * HNSW-after-delete bug (see the example's README). After this + * call the wrapper still works exactly as before — only the + * underlying connection is fresh, which forces a clean index + * rebuild on the next read. + * + * @param {{ readOnly?: boolean }} [opts] + */ + reopen(opts = {}) { + if (this.path === ':memory:') { + throw new Error('reopen(): in-memory databases cannot be reopened (state would be lost)'); + } + this._db.close(); + this._db = opts.readOnly + ? Database.openReadOnly(this.path) + : new Database(this.path); + } + + close() { + this._db.close(); + } +} + +// ------------------------------------------------------------------ +// FTS tokenizer mirror. +// +// The engine's FTS tokenizer (docs/fts.md) splits on `[^A-Za-z0-9]+` +// and lowercases. We replicate it in JS so we can pre-check whether a +// query string would yield any tokens — if not, the FTS WHERE clause +// matches nothing and we should fall back to vector-only. + +const TOKEN_RE = /[A-Za-z0-9]+/g; +const STOPWORDS = new Set([ + 'a', 'an', 'and', 'or', 'the', 'is', 'are', 'was', 'were', 'be', 'been', + 'in', 'on', 'at', 'to', 'of', 'for', 'with', 'by', 'as', 'it', 'this', + 'that', 'these', 'those', 'i', 'you', 'we', 'they', 'he', 'she', +]); + +/** + * Tokenize a query the same way the engine's FTS tokenizer would, + * then drop a tiny stop-list to avoid `fts_match` ballooning into a + * full-table scan on filler words. (The engine has no stop list of + * its own — that's intentional, see `docs/fts.md`. But for retrieval + * we definitely don't want "the" + "is" to drive ranking.) + * + * @param {string} text + * @returns {string[]} + */ +export function ftsTokenize(text) { + if (!text) return []; + const matches = text.match(TOKEN_RE) ?? []; + return matches + .map((t) => t.toLowerCase()) + .filter((t) => t.length > 1 && !STOPWORDS.has(t)); +} + +function clamp01(x) { + if (!Number.isFinite(x)) return 0.5; + if (x < 0) return 0; + if (x > 1) return 1; + return x; +} diff --git a/examples/nodejs-notes/src/embeddings.mjs b/examples/nodejs-notes/src/embeddings.mjs new file mode 100644 index 0000000..bcdb179 --- /dev/null +++ b/examples/nodejs-notes/src/embeddings.mjs @@ -0,0 +1,152 @@ +// Embedding-provider abstractions. +// +// Two providers: +// +// 1. `hash` (default, offline) — a token-bag hash embedder that +// lets users run the whole pipeline without an API key. +// Quality is bag-of-words-ish; good for demos and tests, not +// for production RAG. +// +// 2. `openai` — `text-embedding-3-small`. Pinned to the `dimensions` +// override so we stay at 384 dims for compatibility with the +// schema (and for parity with the python-agent example). +// +// All providers share the same surface: `await embed(text)` returns +// a `number[]` of `provider.dim` items. + +import { DEFAULT_EMBEDDING_DIM } from './config.mjs'; + +/** + * @typedef {object} Embedder + * @property {string} name + * @property {number} dim + * @property {(text: string) => Promise} embed + */ + +/** + * Build an embedder by name. Throws if the configuration is invalid + * (e.g. `openai` without `OPENAI_API_KEY`). + * + * @param {{ kind?: string, dim?: number, model?: string, apiKey?: string, fetchFn?: typeof fetch }} opts + * @returns {Embedder} + */ +export function makeEmbedder(opts = {}) { + const kind = opts.kind ?? process.env.SQLRITE_NOTES_EMBEDDER ?? 'hash'; + const dim = opts.dim ?? DEFAULT_EMBEDDING_DIM; + if (kind === 'hash') return makeHashEmbedder(dim); + if (kind === 'openai') { + const apiKey = opts.apiKey ?? process.env.OPENAI_API_KEY; + if (!apiKey) { + throw new Error( + 'openai embedder: set OPENAI_API_KEY (or pass --embedder hash to run offline).', + ); + } + const model = opts.model ?? process.env.SQLRITE_NOTES_OPENAI_MODEL ?? 'text-embedding-3-small'; + return makeOpenAIEmbedder({ + apiKey, + model, + dim, + fetchFn: opts.fetchFn ?? fetch, + }); + } + throw new Error(`unknown embedder kind: ${JSON.stringify(kind)} (expected "hash" or "openai")`); +} + +// ------------------------------------------------------------------ +// Hash embedder +// +// Deterministic, zero-dependency, offline. Maps each whitespace +// token through a tiny FNV-1a hash into one of `dim` slots, scales +// by token frequency, then L2-normalizes the result so cosine +// similarity is meaningful. + +/** + * @param {number} dim + * @returns {Embedder} + */ +export function makeHashEmbedder(dim) { + return { + name: 'hash', + dim, + async embed(text) { + const vec = new Float64Array(dim); + const tokens = (text || '').toLowerCase().match(/[a-z0-9]+/g) ?? []; + for (const tok of tokens) { + const slot = fnv1a32(tok) % dim; + vec[slot] += 1; + } + // L2 normalize (zero-safe). + let sumSq = 0; + for (let i = 0; i < vec.length; i++) sumSq += vec[i] * vec[i]; + const norm = Math.sqrt(sumSq); + if (norm === 0) return Array.from(vec); + const out = new Array(dim); + for (let i = 0; i < vec.length; i++) out[i] = vec[i] / norm; + return out; + }, + }; +} + +function fnv1a32(s) { + // Classic FNV-1a 32-bit, returns a non-negative integer. + let h = 0x811c9dc5; + for (let i = 0; i < s.length; i++) { + h ^= s.charCodeAt(i); + h = Math.imul(h, 0x01000193); + } + return h >>> 0; +} + +// ------------------------------------------------------------------ +// OpenAI embedder + +/** + * @param {{ apiKey: string, model: string, dim: number, fetchFn: typeof fetch }} args + * @returns {Embedder} + */ +export function makeOpenAIEmbedder({ apiKey, model, dim, fetchFn }) { + return { + name: `openai/${model}`, + dim, + async embed(text) { + const body = JSON.stringify({ + model, + input: text, + dimensions: dim, + }); + const res = await fetchFn('https://api.openai.com/v1/embeddings', { + method: 'POST', + headers: { + 'content-type': 'application/json', + authorization: `Bearer ${apiKey}`, + }, + body, + }); + if (!res.ok) { + const detail = await safeText(res); + throw new Error( + `OpenAI embeddings API error ${res.status}: ${detail.slice(0, 300)}`, + ); + } + const json = await res.json(); + const vec = json?.data?.[0]?.embedding; + if (!Array.isArray(vec)) { + throw new Error('OpenAI embeddings: malformed response (no data[0].embedding)'); + } + if (vec.length !== dim) { + throw new Error( + `OpenAI embeddings: returned ${vec.length} dims, expected ${dim}`, + ); + } + return vec; + }, + }; +} + +async function safeText(res) { + try { + return await res.text(); + } catch { + return ''; + } +} diff --git a/examples/nodejs-notes/src/ingest.mjs b/examples/nodejs-notes/src/ingest.mjs new file mode 100644 index 0000000..2c4cfcf --- /dev/null +++ b/examples/nodejs-notes/src/ingest.mjs @@ -0,0 +1,249 @@ +// Markdown → SQLRite ingest pipeline. +// +// Walks a directory of `.md` / `.markdown` files, chunks each one, +// embeds every chunk, and writes documents + chunks into the DB. +// Two entry points: +// +// - `ingest(...)` — full reindex of a directory. Used by `init`. +// - `refresh(...)` — incremental: skip files whose mtime + content +// hash haven't changed since the last run. Used by `refresh`. +// +// Both flow through `ingestImpl`, which splits the work into three +// phases: PLAN (read-only diff against the current DB) → DELETE (drop +// stale documents/chunks; close + reopen the DB) → INSERT (write new +// rows). The close/reopen between DELETE and INSERT is a workaround +// for an engine bug where the HNSW chunk index panics when rows are +// deleted and re-inserted in the same connection lifetime — see the +// "Known limitations" section of this example's README. + +import { readFile, readdir, stat } from 'node:fs/promises'; +import { createHash } from 'node:crypto'; +import { join, relative, basename, extname } from 'node:path'; + +import { + stripFrontmatter, + deriveTitle, + chunkMarkdown, +} from './chunker.mjs'; +import { DEFAULT_CHUNK_TOKENS, DEFAULT_CHUNK_OVERLAP } from './config.mjs'; + +/** + * @typedef {object} IngestStats + * @property {number} files + * @property {number} chunks + * @property {number} skipped + * @property {number} deleted + * @property {number} elapsedMs + */ + +/** + * Find every markdown file under `root` (recursive). Ignores hidden + * directories (`.git`, `.obsidian`, etc.) and `node_modules` so a + * dropped-in Obsidian vault doesn't suck in junk. + * + * @param {string} root + * @returns {Promise} + */ +export async function findMarkdownFiles(root) { + const out = []; + await walk(root, out); + out.sort(); + return out; +} + +async function walk(dir, out) { + let entries; + try { + entries = await readdir(dir, { withFileTypes: true }); + } catch { + return; // root may not exist; let the caller surface the message. + } + for (const ent of entries) { + const full = join(dir, ent.name); + if (ent.isDirectory()) { + if (ent.name.startsWith('.') || ent.name === 'node_modules') continue; + await walk(full, out); + continue; + } + if (!ent.isFile()) continue; + const ext = extname(ent.name).toLowerCase(); + if (ext === '.md' || ext === '.markdown') out.push(full); + } +} + +/** + * Re-ingest every file under `root` — replaces any existing rows for + * the same path. Use for the `init` flow. + * + * @param {{ db: NotesDB, root: string, embedder: import('./embeddings.mjs').Embedder, logger?: (s: string) => void, chunkOpts?: { targetTokens?: number, overlapTokens?: number } }} args + * @returns {Promise} + */ +export async function ingest(args) { + return ingestImpl({ ...args, mode: 'full' }); +} + +/** + * Incremental re-ingest. Skips files whose mtime + content hash + * matches what's already in the DB. Deletes documents whose file is + * gone from disk. + * + * @param {{ db: NotesDB, root: string, embedder: import('./embeddings.mjs').Embedder, logger?: (s: string) => void, chunkOpts?: { targetTokens?: number, overlapTokens?: number } }} args + * @returns {Promise} + */ +export async function refresh(args) { + return ingestImpl({ ...args, mode: 'incremental' }); +} + +/** + * @param {{ db: NotesDB, root: string, embedder: import('./embeddings.mjs').Embedder, logger?: (s: string) => void, chunkOpts?: { targetTokens?: number, overlapTokens?: number }, mode: 'full' | 'incremental' }} args + * @returns {Promise} + */ +async function ingestImpl({ db, root, embedder, logger, chunkOpts, mode }) { + const log = logger ?? (() => {}); + const t0 = Date.now(); + const target = chunkOpts?.targetTokens ?? DEFAULT_CHUNK_TOKENS; + const overlap = chunkOpts?.overlapTokens ?? DEFAULT_CHUNK_OVERLAP; + + const files = await findMarkdownFiles(root); + if (files.length === 0) { + log(`no markdown files found under ${root}`); + return { files: 0, chunks: 0, skipped: 0, deleted: 0, elapsedMs: 0 }; + } + + // ---------------------------------------------------------------- + // PHASE 1 — plan. Read the current DB state, hash each on-disk + // file, build the change set. No writes yet. + const existing = db.listDocuments(); + /** @type {Array<{ relPath: string, abs: string, mtime: number, text: string, hash: string, priorId: number | null }>} */ + const planUpserts = []; + /** @type {number[]} */ + const planDeletes = []; + let skipped = 0; + const seenPaths = new Set(); + + for (const abs of files) { + const rel = relative(root, abs); + const text = await readFile(abs, 'utf8'); + const fstat = await stat(abs); + const mtime = Math.floor(fstat.mtimeMs / 1000); + const hash = sha256Hex(text); + seenPaths.add(rel); + const prior = existing.get(rel); + + if (mode === 'incremental' && prior && prior.mtime === mtime && prior.contentHash === hash) { + skipped++; + continue; + } + planUpserts.push({ + relPath: rel, + abs, + mtime, + text, + hash, + priorId: prior?.id ?? null, + }); + } + // Files that vanished from disk — only when refreshing. + if (mode === 'incremental') { + for (const [path, prior] of existing) { + if (!seenPaths.has(path)) planDeletes.push(prior.id); + } + } + // Full ingest implicitly replaces every existing doc that we're + // re-ingesting. Drop docs no longer present on disk too, so a + // re-run of `init` against a different source dir doesn't leave + // orphans behind. + if (mode === 'full') { + for (const [path, prior] of existing) { + if (!seenPaths.has(path)) planDeletes.push(prior.id); + } + } + + // Embed BEFORE touching the DB. If anything throws here (e.g. a + // network embedding call fails) we haven't mutated anything. + /** @type {Array<{ plan: typeof planUpserts[number], title: string, body: string, chunks: Array<{ ord: number, content: string, embedding: number[] }> }>} */ + const embedded = []; + let totalEmbedded = 0; + for (const p of planUpserts) { + const { frontmatter, body } = stripFrontmatter(p.text); + const title = deriveTitle({ + frontmatter, + body, + fallback: basename(p.abs, extname(p.abs)), + }); + const chunks = chunkMarkdown(body, { targetTokens: target, overlapTokens: overlap }); + if (chunks.length === 0) { + log(`skipped empty: ${p.relPath}`); + continue; + } + const embeds = []; + for (const c of chunks) { + const v = await embedder.embed(c.content); + embeds.push({ ord: c.ord, content: c.content, embedding: v }); + totalEmbedded++; + } + embedded.push({ plan: p, title, body, chunks: embeds }); + if (embedded.length % 10 === 0) { + log(` embedded ${embedded.length}/${planUpserts.length} files (${totalEmbedded} chunks)…`); + } + } + + const hasMutations = planDeletes.length > 0 || embedded.some((e) => e.plan.priorId !== null); + + // ---------------------------------------------------------------- + // PHASE 2 — deletes (and replacing-deletes). + // + // The engine's HNSW index has a bug where rows deleted and re- + // inserted within the same connection lifetime can corrupt the + // index's stored vectors (see ../README.md "Known limitations"). + // Closing + reopening the connection between the delete-pass and + // the insert-pass forces a full index rebuild on next open, + // sidestepping the issue. We only pay this cost when there's + // actually something to delete; pure-INSERT runs (first `init`) + // skip this hop entirely. + if (hasMutations) { + db.transaction(() => { + for (const id of planDeletes) db.deleteDocument(id); + for (const e of embedded) { + if (e.plan.priorId !== null) db.deleteDocument(e.plan.priorId); + } + }); + db.reopen(); + } + + // ---------------------------------------------------------------- + // PHASE 3 — inserts. + let totalChunks = 0; + for (const e of embedded) { + db.transaction(() => { + const { id } = db.upsertDocument({ + path: e.plan.relPath, + title: e.title, + mtime: e.plan.mtime, + content: e.body, + contentHash: e.plan.hash, + }); + for (const c of e.chunks) { + db.insertChunk({ + documentId: id, + ord: c.ord, + content: c.content, + embedding: c.embedding, + }); + } + }); + totalChunks += e.chunks.length; + } + + return { + files: embedded.length, + chunks: totalChunks, + skipped, + deleted: planDeletes.length, + elapsedMs: Date.now() - t0, + }; +} + +function sha256Hex(input) { + return createHash('sha256').update(input).digest('hex'); +} diff --git a/examples/nodejs-notes/src/search.mjs b/examples/nodejs-notes/src/search.mjs new file mode 100644 index 0000000..650acf1 --- /dev/null +++ b/examples/nodejs-notes/src/search.mjs @@ -0,0 +1,51 @@ +// Hybrid retrieval driver for the `search` debug command. +// +// Same shape an LLM would get over MCP through `vector_search` + +// `bm25_search`, but with rendered output for humans. + +/** + * @param {{ db: import('./db.mjs').NotesDB, embedder: import('./embeddings.mjs').Embedder, query: string, k?: number, weight?: number }} args + */ +export async function search({ db, embedder, query, k = 5, weight = 0.5 }) { + const embedding = await embedder.embed(query); + return db.hybridSearch({ query, embedding, k, weight }); +} + +/** + * Render a list of search results as a human-friendly string. + * + * @param {string} query + * @param {ReturnType} hits + */ +export function renderResults(query, hits) { + if (hits.length === 0) { + return `no results for: ${JSON.stringify(query)}\n`; + } + const lines = []; + lines.push(`top ${hits.length} hits for: ${JSON.stringify(query)}`); + lines.push(''); + for (let i = 0; i < hits.length; i++) { + const h = hits[i]; + const head = h.title ? `${h.title} — ${h.path}` : h.path; + lines.push(`${pad(i + 1)}. ${head} (chunk ${h.ord})`); + lines.push(indent(truncate(h.content, 280))); + lines.push(''); + } + return lines.join('\n'); +} + +function pad(n) { + return String(n).padStart(2, ' '); +} + +function indent(text) { + return text + .split(/\r?\n/) + .map((l) => ` ${l}`) + .join('\n'); +} + +function truncate(text, max) { + const t = text.replace(/\s+/g, ' ').trim(); + return t.length <= max ? t : `${t.slice(0, max - 1)}…`; +} diff --git a/examples/nodejs-notes/src/serve.mjs b/examples/nodejs-notes/src/serve.mjs new file mode 100644 index 0000000..43247f9 --- /dev/null +++ b/examples/nodejs-notes/src/serve.mjs @@ -0,0 +1,113 @@ +// `serve` — spawn `sqlrite-mcp --read-only ` with stdio inherited. +// +// The point of this command is to remove the "find the binary, then +// write the right `args` array" step from Claude Desktop config: +// users wire ONE thing (`sqlrite-notes serve`) and never have to +// know where `sqlrite-mcp` lives. The MCP client speaks JSON-RPC +// over our stdio; we just shovel it to/from the child. + +import { spawn } from 'node:child_process'; +import { existsSync } from 'node:fs'; +import { join } from 'node:path'; +import { homedir } from 'node:os'; + +/** + * Try a sequence of well-known locations to find a `sqlrite-mcp` + * binary. Order: + * + * 1. `SQLRITE_MCP_BIN` env var (explicit override). + * 2. `which sqlrite-mcp` via `PATH`. + * 3. `~/.cargo/bin/sqlrite-mcp` (cargo install default). + * + * @returns {string | null} + */ +export function locateMcpBinary() { + const env = process.env.SQLRITE_MCP_BIN; + if (env) { + if (!existsSync(env)) { + throw new Error( + `SQLRITE_MCP_BIN=${env} is set but the file doesn't exist.`, + ); + } + return env; + } + + // PATH lookup. `process.env.PATH` is the only thing we can portably + // check without shelling out; spawning `which` adds latency for no + // benefit since `spawn(name)` will already use PATH on Unix. + const pathDirs = (process.env.PATH ?? '').split(process.platform === 'win32' ? ';' : ':'); + const exeName = process.platform === 'win32' ? 'sqlrite-mcp.exe' : 'sqlrite-mcp'; + for (const dir of pathDirs) { + if (!dir) continue; + const candidate = join(dir, exeName); + if (existsSync(candidate)) return candidate; + } + + // Cargo install fallback. + const cargoBin = join(homedir(), '.cargo', 'bin', exeName); + if (existsSync(cargoBin)) return cargoBin; + + return null; +} + +/** + * Spawn `sqlrite-mcp --read-only ` with stdio inherited. Returns + * a Promise that resolves with the child's exit code. + * + * @param {{ dbPath: string, extraArgs?: string[], stderr?: NodeJS.WritableStream }} args + * @returns {Promise} + */ +export function spawnMcpServer({ dbPath, extraArgs = [], stderr }) { + const bin = locateMcpBinary(); + if (!bin) { + throw new Error( + 'sqlrite-mcp binary not found.\n' + + '\n' + + 'Install it one of these ways:\n' + + ' cargo install sqlrite-mcp\n' + + ' # or download from https://github.com/joaoh82/rust_sqlite/releases\n' + + '\n' + + 'You can also override the lookup with SQLRITE_MCP_BIN=/path/to/sqlrite-mcp.\n', + ); + } + + // Build args. `--read-only` is the whole reason this wrapper exists: + // we never want Claude (or any other MCP client) to mutate the notes + // DB out from under the ingest pipeline. + const args = [dbPath, '--read-only', ...extraArgs]; + + return new Promise((resolve, reject) => { + const child = spawn(bin, args, { + // stdin / stdout MUST be inherited so the MCP client can talk to + // the child directly. stderr we pipe to wherever the caller asks + // (default: our own stderr). + stdio: ['inherit', 'inherit', stderr ? 'pipe' : 'inherit'], + env: process.env, + }); + if (stderr && child.stderr) { + child.stderr.pipe(stderr); + } + child.on('error', reject); + child.on('exit', (code, signal) => { + if (signal) { + // Propagate the signal as a non-zero exit code so Claude + // Desktop sees the failure cleanly. + resolve(128 + (signalToNumber(signal) ?? 1)); + } else { + resolve(code ?? 0); + } + }); + // Forward SIGINT / SIGTERM to the child so Ctrl-C in the parent + // shuts the child down rather than orphaning it. + const forward = (sig) => { + if (!child.killed) child.kill(sig); + }; + process.once('SIGINT', () => forward('SIGINT')); + process.once('SIGTERM', () => forward('SIGTERM')); + }); +} + +function signalToNumber(sig) { + const map = { SIGINT: 2, SIGTERM: 15, SIGKILL: 9, SIGHUP: 1 }; + return map[sig]; +} diff --git a/examples/nodejs-notes/src/sqlutil.mjs b/examples/nodejs-notes/src/sqlutil.mjs new file mode 100644 index 0000000..33aa1ba --- /dev/null +++ b/examples/nodejs-notes/src/sqlutil.mjs @@ -0,0 +1,63 @@ +// Tiny SQL-literal helpers — the SQLRite engine doesn't support +// `?`-style parameter binding yet (Phase 5a.2 follow-up), so every +// caller must inline values as SQL literals. This module is the +// single place that does that safely. +// +// Mirrors the shape of `sqlrite_agent.sqlutil` in the Python example. + +/** + * Quote a JavaScript value as a SQL literal. + * + * - string → `'escaped'` (single quotes doubled per the SQL standard) + * - number → integer or `Number.prototype.toString()` for finite floats + * - boolean → `TRUE` / `FALSE` + * - null/undefined → `NULL` + * - number[] → `[v1, v2, ...]` — the engine's vector literal syntax + * + * Anything else throws — refuse to silently `String()` an object. + * + * @param {unknown} value + * @returns {string} + */ +export function q(value) { + if (value === null || value === undefined) return 'NULL'; + if (typeof value === 'string') return `'${value.replaceAll("'", "''")}'`; + if (typeof value === 'number') { + if (!Number.isFinite(value)) { + throw new TypeError(`q(): non-finite number ${value}`); + } + return Number.isInteger(value) ? String(value) : value.toString(); + } + if (typeof value === 'bigint') return value.toString(); + if (typeof value === 'boolean') return value ? 'TRUE' : 'FALSE'; + if (Array.isArray(value)) { + // Vector literal — every element must be finite numeric. + const parts = value.map((v, i) => { + if (typeof v !== 'number' || !Number.isFinite(v)) { + throw new TypeError(`q(): vector element ${i} is not a finite number (got ${v})`); + } + // toString() emits the shortest round-trippable form; the + // engine's parser accepts both fixed-point and exponential. + return v.toString(); + }); + return `[${parts.join(', ')}]`; + } + throw new TypeError(`q(): unsupported value type ${typeof value}`); +} + +/** + * Validate a SQL identifier (table / column / index name) against the + * unquoted-identifier subset the engine accepts. Throws if invalid. + * + * Use this for ANY identifier that ultimately gets inlined into SQL — + * callers shouldn't have to guess what's safe. + * + * @param {string} name + * @returns {string} the same name (for chaining) + */ +export function ident(name) { + if (typeof name !== 'string' || !/^[A-Za-z_][A-Za-z0-9_]*$/.test(name)) { + throw new TypeError(`ident(): invalid SQL identifier ${JSON.stringify(name)}`); + } + return name; +} diff --git a/examples/nodejs-notes/test/chunker.test.mjs b/examples/nodejs-notes/test/chunker.test.mjs new file mode 100644 index 0000000..d6faac2 --- /dev/null +++ b/examples/nodejs-notes/test/chunker.test.mjs @@ -0,0 +1,85 @@ +import test from 'node:test'; +import assert from 'node:assert/strict'; + +import { + stripFrontmatter, + deriveTitle, + chunkMarkdown, + approxTokens, +} from '../src/chunker.mjs'; + +test('stripFrontmatter — YAML between --- fences', () => { + const text = '---\ntitle: Foo\ntags: [a]\n---\n\nBody.'; + const { frontmatter, body } = stripFrontmatter(text); + assert.match(frontmatter, /^title: Foo/); + assert.equal(body.trim(), 'Body.'); +}); + +test('stripFrontmatter — no frontmatter passes through', () => { + const text = '# Heading\n\nBody.'; + const { frontmatter, body } = stripFrontmatter(text); + assert.equal(frontmatter, ''); + assert.equal(body, text); +}); + +test('deriveTitle — picks frontmatter title first', () => { + assert.equal( + deriveTitle({ frontmatter: 'title: Hello World', body: '# Other', fallback: 'fb' }), + 'Hello World', + ); +}); + +test('deriveTitle — falls back to first heading', () => { + assert.equal( + deriveTitle({ frontmatter: '', body: '# My Heading\n\nbody.', fallback: 'fb' }), + 'My Heading', + ); +}); + +test('deriveTitle — falls back to filename stem', () => { + assert.equal(deriveTitle({ frontmatter: '', body: 'no heading', fallback: 'fb' }), 'fb'); +}); + +test('approxTokens — whitespace word count', () => { + assert.equal(approxTokens(''), 0); + assert.equal(approxTokens('one two three'), 3); + assert.equal(approxTokens(' a b '), 2); +}); + +test('chunkMarkdown — single short doc fits in one chunk', () => { + const out = chunkMarkdown('# Title\n\nA short paragraph.\n'); + assert.equal(out.length, 1); + assert.match(out[0].content, /Title/); + assert.match(out[0].content, /short paragraph/); +}); + +test('chunkMarkdown — long doc splits into multiple chunks', () => { + const big = Array.from({ length: 20 }, (_, i) => `Paragraph ${i}: ${'word '.repeat(60)}`).join( + '\n\n', + ); + const out = chunkMarkdown(`# Heading\n\n${big}`, { targetTokens: 200, overlapTokens: 20 }); + assert.ok(out.length > 1, `expected >1 chunk, got ${out.length}`); + // Each chunk should be non-empty. + for (const c of out) { + assert.ok(c.content.length > 0); + } +}); + +test('chunkMarkdown — overlap copies tail tokens forward', () => { + const body = + '# A\n\n' + + Array.from({ length: 5 }, (_, i) => `alpha${i} ${'lorem '.repeat(80)}`).join('\n\n'); + const out = chunkMarkdown(body, { targetTokens: 100, overlapTokens: 30 }); + assert.ok(out.length >= 2); + // The second chunk should contain the tail of the first. + const firstTail = out[0].content.split(/\s+/).slice(-20).join(' '); + assert.ok( + out[1].content.includes(firstTail.split(' ').slice(-5).join(' ')), + 'second chunk should overlap with the tail of the first', + ); +}); + +test('chunkMarkdown — empty body produces no chunks', () => { + assert.deepEqual(chunkMarkdown(''), []); + assert.deepEqual(chunkMarkdown(' \n\n '), []); +}); diff --git a/examples/nodejs-notes/test/claude-config.test.mjs b/examples/nodejs-notes/test/claude-config.test.mjs new file mode 100644 index 0000000..d5e7217 --- /dev/null +++ b/examples/nodejs-notes/test/claude-config.test.mjs @@ -0,0 +1,36 @@ +import test from 'node:test'; +import assert from 'node:assert/strict'; + +import { buildConfig, renderInstructions } from '../src/claude-config.mjs'; + +test('buildConfig — default command is "sqlrite-notes"', () => { + const cfg = buildConfig({ dbPath: '/tmp/notes.sqlrite' }); + assert.deepEqual(cfg, { + mcpServers: { + 'sqlrite-notes': { + command: 'sqlrite-notes', + args: ['serve', '--db', '/tmp/notes.sqlrite'], + }, + }, + }); +}); + +test('buildConfig — explicit binPath wins', () => { + const cfg = buildConfig({ + dbPath: '/tmp/notes.sqlrite', + binPath: '/opt/sqlrite-notes/bin/sqlrite-notes.mjs', + }); + assert.equal( + cfg.mcpServers['sqlrite-notes'].command, + '/opt/sqlrite-notes/bin/sqlrite-notes.mjs', + ); +}); + +test('renderInstructions — embeds JSON block and Claude Desktop path', () => { + const out = renderInstructions({ dbPath: '/tmp/notes.sqlrite' }); + assert.match(out, /mcpServers/); + assert.match(out, /sqlrite-notes/); + assert.match(out, /"command": "sqlrite-notes"/); + assert.match(out, /serve/); + assert.match(out, /modelcontextprotocol/); // inspector hint +}); diff --git a/examples/nodejs-notes/test/db.test.mjs b/examples/nodejs-notes/test/db.test.mjs new file mode 100644 index 0000000..210dddf --- /dev/null +++ b/examples/nodejs-notes/test/db.test.mjs @@ -0,0 +1,272 @@ +// Integration tests against a real SQLRite Connection. These require +// the @joaoh82/sqlrite Node binding to be built/installed; if it +// isn't, the suite emits a skip notice instead of failing. + +import test from 'node:test'; +import assert from 'node:assert/strict'; +import { mkdtempSync, rmSync } from 'node:fs'; +import { tmpdir } from 'node:os'; +import { join } from 'node:path'; + +let NotesDB; +let skipReason = null; +try { + ({ NotesDB } = await import('../src/db.mjs')); +} catch (err) { + skipReason = `cannot import db.mjs (build the Node SDK first?): ${err.message}`; +} + +// Node 24's test runner treats `{ skip: null }` as a skip directive +// (the key's presence matters more than its value), so use this +// helper to conditionally pass the option only when we genuinely +// want to skip. +const maybeSkip = skipReason ? { skip: skipReason } : {}; + +function withDb(fn) { + const dir = mkdtempSync(join(tmpdir(), 'sqlrite-notes-test-')); + const path = join(dir, 'notes.sqlrite'); + try { + return fn({ dir, path }); + } finally { + rmSync(dir, { recursive: true, force: true }); + } +} + +test('db: schema applies cleanly + stats start at zero', maybeSkip, () => { + withDb(({ path }) => { + const db = new NotesDB(path, { dim: 16 }); + try { + const s = db.stats(); + assert.equal(s.documents, 0); + assert.equal(s.chunks, 0); + assert.equal(s.dim, 16); + } finally { + db.close(); + } + }); +}); + +test('db: upsertDocument + insertChunk round-trip', maybeSkip, () => { + withDb(({ path }) => { + const db = new NotesDB(path, { dim: 4 }); + try { + const { id, replaced } = db.upsertDocument({ + path: 'a.md', + title: 'A', + mtime: 100, + content: 'rust embedded database notes', + contentHash: 'h1', + }); + assert.ok(id > 0); + assert.equal(replaced, false); + + db.insertChunk({ + documentId: id, + ord: 0, + content: 'rust embedded database notes', + embedding: [1, 0, 0, 0], + }); + + const s = db.stats(); + assert.equal(s.documents, 1); + assert.equal(s.chunks, 1); + } finally { + db.close(); + } + }); +}); + +test('db: upsertDocument replaces prior version + cascades chunks', maybeSkip, () => { + withDb(({ path }) => { + const db = new NotesDB(path, { dim: 4 }); + try { + const v1 = db.upsertDocument({ + path: 'a.md', + title: 'A', + mtime: 100, + content: 'one', + contentHash: 'h1', + }); + db.insertChunk({ + documentId: v1.id, + ord: 0, + content: 'one', + embedding: [1, 0, 0, 0], + }); + + const v2 = db.upsertDocument({ + path: 'a.md', + title: 'A v2', + mtime: 200, + content: 'two', + contentHash: 'h2', + }); + assert.equal(v2.replaced, true); + assert.notEqual(v2.id, v1.id); + + const s = db.stats(); + assert.equal(s.documents, 1); + assert.equal(s.chunks, 0); // old chunk got dropped on replace + } finally { + db.close(); + } + }); +}); + +test( + 'db: hybridSearch returns vector + BM25 hits in a sensible order', + maybeSkip, + () => { + withDb(({ path }) => { + const db = new NotesDB(path, { dim: 4 }); + try { + const { id: dA } = db.upsertDocument({ + path: 'a.md', + title: 'A', + mtime: 1, + content: 'rust embedded database', + contentHash: 'h1', + }); + const { id: dB } = db.upsertDocument({ + path: 'b.md', + title: 'B', + mtime: 2, + content: 'distributed systems and consensus protocols', + contentHash: 'h2', + }); + // Two chunks each, with distinct embeddings. + db.insertChunk({ + documentId: dA, + ord: 0, + content: 'rust embedded database', + embedding: [1, 0, 0, 0], + }); + db.insertChunk({ + documentId: dB, + ord: 0, + content: 'distributed systems and consensus protocols', + embedding: [0, 1, 0, 0], + }); + + // A query whose embedding aligns with chunk A and whose + // tokens overlap chunk A — A should win. + const hits = db.hybridSearch({ + query: 'rust database', + embedding: [1, 0, 0, 0], + k: 2, + }); + assert.ok(hits.length >= 1); + assert.equal(hits[0].path, 'a.md'); + } finally { + db.close(); + } + }); + }, +); + +test( + 'db: hybridSearch falls back to vector-only when FTS tokens are empty', + maybeSkip, + () => { + withDb(({ path }) => { + const db = new NotesDB(path, { dim: 4 }); + try { + const { id } = db.upsertDocument({ + path: 'a.md', + title: 'A', + mtime: 1, + content: 'rust embedded database', + contentHash: 'h1', + }); + db.insertChunk({ + documentId: id, + ord: 0, + content: 'rust embedded database', + embedding: [1, 0, 0, 0], + }); + const hits = db.hybridSearch({ + query: '日本語', // every byte non-ASCII → no FTS tokens + embedding: [1, 0, 0, 0], + k: 5, + }); + assert.equal(hits.length, 1); + assert.equal(hits[0].path, 'a.md'); + } finally { + db.close(); + } + }); + }, +); + +test('db: deleteDocument cascades to chunks', maybeSkip, () => { + withDb(({ path }) => { + const db = new NotesDB(path, { dim: 4 }); + try { + const { id } = db.upsertDocument({ + path: 'a.md', + title: 'A', + mtime: 1, + content: 'x', + contentHash: 'h', + }); + db.insertChunk({ + documentId: id, + ord: 0, + content: 'x', + embedding: [1, 0, 0, 0], + }); + db.deleteDocument(id); + const s = db.stats(); + assert.equal(s.documents, 0); + assert.equal(s.chunks, 0); + } finally { + db.close(); + } + }); +}); + +test('db: listDocuments → path map', maybeSkip, () => { + withDb(({ path }) => { + const db = new NotesDB(path, { dim: 4 }); + try { + db.upsertDocument({ + path: 'a.md', + title: 'A', + mtime: 100, + content: 'x', + contentHash: 'h1', + }); + db.upsertDocument({ + path: 'b.md', + title: 'B', + mtime: 200, + content: 'y', + contentHash: 'h2', + }); + const map = db.listDocuments(); + assert.equal(map.size, 2); + assert.equal(map.get('a.md')?.mtime, 100); + assert.equal(map.get('b.md')?.contentHash, 'h2'); + } finally { + db.close(); + } + }); +}); + +test('db: re-open path-backed DB reads back data', maybeSkip, () => { + withDb(({ path }) => { + const a = new NotesDB(path, { dim: 4 }); + a.upsertDocument({ path: 'a.md', title: 'A', mtime: 1, content: 'x', contentHash: 'h' }); + a.close(); + + const b = new NotesDB(path, { dim: 4 }); + try { + const s = b.stats(); + assert.equal(s.documents, 1); + const map = b.listDocuments(); + assert.equal(map.get('a.md')?.mtime, 1); + } finally { + b.close(); + } + }); +}); diff --git a/examples/nodejs-notes/test/embeddings.test.mjs b/examples/nodejs-notes/test/embeddings.test.mjs new file mode 100644 index 0000000..7612123 --- /dev/null +++ b/examples/nodejs-notes/test/embeddings.test.mjs @@ -0,0 +1,104 @@ +import test from 'node:test'; +import assert from 'node:assert/strict'; + +import { makeHashEmbedder, makeOpenAIEmbedder, makeEmbedder } from '../src/embeddings.mjs'; + +test('hash embedder — deterministic, unit-norm, requested dim', async () => { + const emb = makeHashEmbedder(384); + const a = await emb.embed('rust embedded database'); + const b = await emb.embed('rust embedded database'); + assert.equal(a.length, 384); + assert.deepEqual(a, b); + const norm = Math.sqrt(a.reduce((s, x) => s + x * x, 0)); + assert.ok(Math.abs(norm - 1) < 1e-9 || norm === 0, `unit norm, got ${norm}`); +}); + +test('hash embedder — different inputs produce different vectors', async () => { + const emb = makeHashEmbedder(64); + const a = await emb.embed('alpha beta gamma'); + const b = await emb.embed('delta epsilon zeta'); + assert.notDeepEqual(a, b); +}); + +test('hash embedder — empty text returns zero vector', async () => { + const emb = makeHashEmbedder(32); + const a = await emb.embed(''); + assert.equal(a.length, 32); + assert.ok(a.every((x) => x === 0)); +}); + +test('makeEmbedder defaults to hash', () => { + const emb = makeEmbedder({ dim: 16 }); + assert.equal(emb.name, 'hash'); + assert.equal(emb.dim, 16); +}); + +test('makeEmbedder openai without API key throws clear error', () => { + const prev = process.env.OPENAI_API_KEY; + delete process.env.OPENAI_API_KEY; + try { + assert.throws( + () => makeEmbedder({ kind: 'openai', dim: 384 }), + /OPENAI_API_KEY/, + ); + } finally { + if (prev !== undefined) process.env.OPENAI_API_KEY = prev; + } +}); + +test('makeEmbedder unknown kind throws', () => { + assert.throws(() => makeEmbedder({ kind: 'word2vec', dim: 8 }), /unknown embedder/); +}); + +test('openai embedder talks to a mocked fetch and validates shape', async () => { + const calls = []; + const fakeFetch = async (url, init) => { + calls.push({ url, init }); + return new Response( + JSON.stringify({ data: [{ embedding: new Array(8).fill(0.5) }] }), + { status: 200, headers: { 'content-type': 'application/json' } }, + ); + }; + const emb = makeOpenAIEmbedder({ + apiKey: 'sk-test', + model: 'text-embedding-3-small', + dim: 8, + fetchFn: fakeFetch, + }); + const v = await emb.embed('hello world'); + assert.equal(v.length, 8); + assert.equal(calls.length, 1); + assert.equal(calls[0].url, 'https://api.openai.com/v1/embeddings'); + const body = JSON.parse(calls[0].init.body); + assert.equal(body.model, 'text-embedding-3-small'); + assert.equal(body.dimensions, 8); + assert.equal(body.input, 'hello world'); + assert.match(calls[0].init.headers.authorization, /^Bearer /); +}); + +test('openai embedder surfaces API errors', async () => { + const fakeFetch = async () => + new Response('rate limited', { status: 429 }); + const emb = makeOpenAIEmbedder({ + apiKey: 'sk-test', + model: 'm', + dim: 4, + fetchFn: fakeFetch, + }); + await assert.rejects(emb.embed('x'), /OpenAI embeddings API error 429/); +}); + +test('openai embedder rejects wrong-dim responses', async () => { + const fakeFetch = async () => + new Response(JSON.stringify({ data: [{ embedding: [1, 2, 3] }] }), { + status: 200, + headers: { 'content-type': 'application/json' }, + }); + const emb = makeOpenAIEmbedder({ + apiKey: 'sk-test', + model: 'm', + dim: 8, + fetchFn: fakeFetch, + }); + await assert.rejects(emb.embed('x'), /returned 3 dims, expected 8/); +}); diff --git a/examples/nodejs-notes/test/fixtures/crdts.md b/examples/nodejs-notes/test/fixtures/crdts.md new file mode 100644 index 0000000..72b5a89 --- /dev/null +++ b/examples/nodejs-notes/test/fixtures/crdts.md @@ -0,0 +1,23 @@ +--- +title: CRDTs for collaborative editing +tags: [crdt, distributed-systems] +--- + +# CRDTs for collaborative editing + +Conflict-free replicated data types let two clients edit the same +document offline and merge the result deterministically. Two flavors +matter: state-based (CvRDTs) and operation-based (CmRDTs). + +## When to reach for one + +If your network is unreliable but you can ship every state mutation +through a message broker, CmRDTs win: smaller payloads, fewer wasted +bytes. + +## What still bites you + +Causality tracking — vector clocks specifically — grows with the +number of replicas. Yjs and Automerge invest a lot of effort in +compressing those metadata structures so they don't dominate the on- +wire payload for long-lived documents. diff --git a/examples/nodejs-notes/test/fixtures/postgres.md b/examples/nodejs-notes/test/fixtures/postgres.md new file mode 100644 index 0000000..76bfd8e --- /dev/null +++ b/examples/nodejs-notes/test/fixtures/postgres.md @@ -0,0 +1,18 @@ +# Notes on Postgres + +Postgres is a relational database server with extension hooks for +storage formats and access methods. The reason it's the default +SQL engine for new projects is the combination of MVCC, +PL/pgSQL, and a permissive license. + +## What I keep forgetting + +Subtransactions are cheap up to a point, then VERY expensive — the +SLRU buffers become the bottleneck. If you find yourself with +nested savepoints in a hot path, audit them. + +## Replication + +Streaming replication via WAL shipping is the default. Logical +replication via decoded WAL records is more flexible but the +publication / subscription dance has more moving parts. diff --git a/examples/nodejs-notes/test/fixtures/running.md b/examples/nodejs-notes/test/fixtures/running.md new file mode 100644 index 0000000..2729fcc --- /dev/null +++ b/examples/nodejs-notes/test/fixtures/running.md @@ -0,0 +1,8 @@ +# Marathon training journal + +Week 1: did a 10-mile long run on Sunday. Tempo run on Wednesday felt +flat — probably under-fuelled. Keep an eye on carb intake the day +before a quality session. + +Week 2: hill repeats on Tuesday went well. The 16-mile long run was +the first one with no GI issues since switching gels. diff --git a/examples/nodejs-notes/test/ingest.test.mjs b/examples/nodejs-notes/test/ingest.test.mjs new file mode 100644 index 0000000..c356477 --- /dev/null +++ b/examples/nodejs-notes/test/ingest.test.mjs @@ -0,0 +1,131 @@ +// End-to-end ingest + search test against the test/fixtures notes. +// Skips cleanly if the @joaoh82/sqlrite Node binding isn't built. + +import test from 'node:test'; +import assert from 'node:assert/strict'; +import { + mkdtempSync, + mkdirSync, + rmSync, + writeFileSync, + utimesSync, + unlinkSync, +} from 'node:fs'; +import { tmpdir } from 'node:os'; +import { join, dirname } from 'node:path'; +import { fileURLToPath } from 'node:url'; + +import { makeHashEmbedder } from '../src/embeddings.mjs'; + +let NotesDB, ingest, refresh, search; +let skipReason = null; +try { + ({ NotesDB } = await import('../src/db.mjs')); + ({ ingest, refresh } = await import('../src/ingest.mjs')); + ({ search } = await import('../src/search.mjs')); +} catch (err) { + skipReason = `cannot import (build the Node SDK first?): ${err.message}`; +} + +const maybeSkip = skipReason ? { skip: skipReason } : {}; + +const here = dirname(fileURLToPath(import.meta.url)); +const fixturesDir = join(here, 'fixtures'); + +test( + 'ingest fixtures → search recalls the right note for each query', + maybeSkip, + async () => { + const dir = mkdtempSync(join(tmpdir(), 'sqlrite-notes-itest-')); + const path = join(dir, 'notes.sqlrite'); + try { + const embedder = makeHashEmbedder(64); + const db = new NotesDB(path, { dim: embedder.dim }); + try { + const stats = await ingest({ db, root: fixturesDir, embedder }); + assert.ok(stats.files >= 3, `expected ≥3 files, got ${stats.files}`); + assert.ok(stats.chunks >= 3, `expected ≥3 chunks, got ${stats.chunks}`); + + const crdtHits = await search({ + db, + embedder, + query: 'collaborative editing CRDT', + k: 3, + }); + assert.ok(crdtHits.length > 0); + assert.equal(crdtHits[0].path, 'crdts.md'); + + const pgHits = await search({ + db, + embedder, + query: 'WAL replication', + k: 3, + }); + assert.ok(pgHits.length > 0); + assert.equal(pgHits[0].path, 'postgres.md'); + + const runHits = await search({ + db, + embedder, + query: 'marathon training long run', + k: 3, + }); + assert.ok(runHits.length > 0); + assert.equal(runHits[0].path, 'running.md'); + } finally { + db.close(); + } + } finally { + rmSync(dir, { recursive: true, force: true }); + } + }, +); + +test( + 'refresh: unchanged files are skipped; changed files re-embedded; deleted removed', + maybeSkip, + async () => { + const dir = mkdtempSync(join(tmpdir(), 'sqlrite-notes-itest-')); + const sourceDir = join(dir, 'notes'); + const dbPath = join(dir, 'notes.sqlrite'); + try { + mkdirSync(sourceDir, { recursive: true }); + writeFileSync(join(sourceDir, 'keep.md'), '# Keep\n\nshould stay verbatim.\n'); + writeFileSync( + join(sourceDir, 'change.md'), + '# Change\n\noriginal body about postgres.\n', + ); + writeFileSync(join(sourceDir, 'remove.md'), '# Remove\n\nwill be deleted later.\n'); + + const embedder = makeHashEmbedder(32); + const db = new NotesDB(dbPath, { dim: embedder.dim }); + try { + const first = await ingest({ db, root: sourceDir, embedder }); + assert.equal(first.files, 3); + + writeFileSync( + join(sourceDir, 'change.md'), + '# Change\n\nrewritten body about distributed systems.\n', + ); + const futureSec = Math.floor(Date.now() / 1000) + 5; + utimesSync(join(sourceDir, 'change.md'), futureSec, futureSec); + unlinkSync(join(sourceDir, 'remove.md')); + + const second = await refresh({ db, root: sourceDir, embedder }); + assert.equal(second.files, 1, 'one file changed'); + assert.equal(second.skipped, 1, 'one file unchanged'); + assert.equal(second.deleted, 1, 'one file removed'); + + const docs = db.listDocuments(); + assert.equal(docs.size, 2); + assert.ok(docs.has('keep.md')); + assert.ok(docs.has('change.md')); + assert.ok(!docs.has('remove.md')); + } finally { + db.close(); + } + } finally { + rmSync(dir, { recursive: true, force: true }); + } + }, +); diff --git a/examples/nodejs-notes/test/serve.test.mjs b/examples/nodejs-notes/test/serve.test.mjs new file mode 100644 index 0000000..2de78b5 --- /dev/null +++ b/examples/nodejs-notes/test/serve.test.mjs @@ -0,0 +1,38 @@ +import test from 'node:test'; +import assert from 'node:assert/strict'; +import { writeFileSync, chmodSync, mkdtempSync, rmSync } from 'node:fs'; +import { tmpdir } from 'node:os'; +import { join } from 'node:path'; + +import { locateMcpBinary } from '../src/serve.mjs'; + +test('locateMcpBinary honors SQLRITE_MCP_BIN when the file exists', () => { + const dir = mkdtempSync(join(tmpdir(), 'sqlrite-mcp-bin-test-')); + try { + const fakeBin = join(dir, 'fake-mcp'); + writeFileSync(fakeBin, '#!/bin/sh\necho fake\n'); + chmodSync(fakeBin, 0o755); + + const prev = process.env.SQLRITE_MCP_BIN; + process.env.SQLRITE_MCP_BIN = fakeBin; + try { + assert.equal(locateMcpBinary(), fakeBin); + } finally { + if (prev === undefined) delete process.env.SQLRITE_MCP_BIN; + else process.env.SQLRITE_MCP_BIN = prev; + } + } finally { + rmSync(dir, { recursive: true, force: true }); + } +}); + +test('locateMcpBinary throws if SQLRITE_MCP_BIN points at a missing file', () => { + const prev = process.env.SQLRITE_MCP_BIN; + process.env.SQLRITE_MCP_BIN = '/definitely/not/real/sqlrite-mcp'; + try { + assert.throws(() => locateMcpBinary(), /SQLRITE_MCP_BIN/); + } finally { + if (prev === undefined) delete process.env.SQLRITE_MCP_BIN; + else process.env.SQLRITE_MCP_BIN = prev; + } +}); diff --git a/examples/nodejs-notes/test/sqlutil.test.mjs b/examples/nodejs-notes/test/sqlutil.test.mjs new file mode 100644 index 0000000..7e617d0 --- /dev/null +++ b/examples/nodejs-notes/test/sqlutil.test.mjs @@ -0,0 +1,47 @@ +import test from 'node:test'; +import assert from 'node:assert/strict'; + +import { q, ident } from '../src/sqlutil.mjs'; + +test('q strings — basic and quote-doubling', () => { + assert.equal(q('hello'), "'hello'"); + assert.equal(q("it's"), "'it''s'"); + assert.equal(q("a'b'c"), "'a''b''c'"); + assert.equal(q(''), "''"); +}); + +test('q numbers — ints, floats, throws on NaN/Inf', () => { + assert.equal(q(0), '0'); + assert.equal(q(42), '42'); + assert.equal(q(-7), '-7'); + assert.equal(q(1.5), '1.5'); + assert.throws(() => q(NaN), TypeError); + assert.throws(() => q(Infinity), TypeError); +}); + +test('q booleans + null', () => { + assert.equal(q(true), 'TRUE'); + assert.equal(q(false), 'FALSE'); + assert.equal(q(null), 'NULL'); + assert.equal(q(undefined), 'NULL'); +}); + +test('q vector — bracket-array literal', () => { + assert.equal(q([0.1, 0.2, 0.3]), '[0.1, 0.2, 0.3]'); + assert.equal(q([]), '[]'); + assert.throws(() => q([0.1, 'x']), TypeError); + assert.throws(() => q([NaN]), TypeError); +}); + +test('q rejects objects', () => { + assert.throws(() => q({}), TypeError); +}); + +test('ident — accepts only the engine\'s unquoted-identifier subset', () => { + assert.equal(ident('users'), 'users'); + assert.equal(ident('_x9'), '_x9'); + assert.throws(() => ident('1users'), TypeError); + assert.throws(() => ident('users; DROP TABLE x'), TypeError); + assert.throws(() => ident('hello world'), TypeError); + assert.throws(() => ident(''), TypeError); +}); diff --git a/web/src/app/examples/page.tsx b/web/src/app/examples/page.tsx index 1819a13..bec7f24 100644 --- a/web/src/app/examples/page.tsx +++ b/web/src/app/examples/page.tsx @@ -48,6 +48,18 @@ const itemListJsonLd = { "A CLI chat agent whose long-term memory is a single .sqlrite file. Vector recall via HNSW, lexical recall via BM25, and a structured facts table for deterministic retrieval.", }, }, + { + "@type": "ListItem", + position: 2, + item: { + "@type": "SoftwareSourceCode", + name: "Chat with your notes — Node.js + Claude Desktop MCP", + url: `${SITE.repo}/tree/main/examples/nodejs-notes`, + programmingLanguage: "JavaScript", + description: + "A Node.js CLI that ingests a folder of markdown notes into SQLRite (HNSW + BM25 indexes), then exposes the database to Claude Desktop via sqlrite-mcp --read-only. Hybrid retrieval over your notes from inside the chat client.", + }, + }, ], }; @@ -77,6 +89,22 @@ const EXAMPLES: Example[] = [ repoPath: "examples/python-agent", features: ["HNSW", "VECTOR(384)", "BM25 / FTS", "PyO3 SDK"], }, + { + status: "shipped", + title: "Chat with your notes — Claude Desktop + MCP", + blurb: + "A Node.js CLI that ingests a folder of markdown notes into a SQLRite database, then exposes it to Claude Desktop (or any MCP client) via sqlrite-mcp --read-only. Claude calls bm25_search / vector_search / query directly against your local notes — no cloud sync, no custom RAG pipeline.", + bullets: [ + "Markdown → frontmatter-aware chunker → hash or OpenAI embedder → SQLRite documents + chunks tables", + "Hybrid retrieval fuses BM25 and vector cosine in a single SQL ORDER BY (see docs/fts.md)", + "`sqlrite-notes serve` wraps sqlrite-mcp so the Claude Desktop config snippet is one block of JSON", + "Default embedder is fully offline (zero-dep hash bag-of-words); flip to text-embedding-3-small with OPENAI_API_KEY", + "40 unit + integration tests; works against the prebuilt @joaoh82/sqlrite npm binaries", + ], + language: "Node.js 20+", + repoPath: "examples/nodejs-notes", + features: ["HNSW", "BM25 / FTS", "MCP server", "napi-rs SDK"], + }, ]; const pillStyle: React.CSSProperties = { @@ -230,11 +258,10 @@ export default function ExamplesIndexPage() { fontSize: 14, }} > - More examples in flight: a Node.js MCP-powered notes - assistant, a Tauri + Svelte journaling desktop app, a - browser SQL playground (WASM), and a Go edge/IoT event - collector. See /docs for the - engine reference. + More examples in flight: a Tauri + Svelte journaling + desktop app, a browser SQL playground (WASM), and a Go + edge/IoT event collector. See{" "} + /docs for the engine reference.