diff --git a/README.md b/README.md
index ba1659e..ce1ad94 100644
--- a/README.md
+++ b/README.md
@@ -108,6 +108,15 @@ Wire it into Claude Code (`~/.claude.json`):
`--read-only` opens the DB with a shared lock and hides the `execute` tool. Full docs + the other six tools' references in [`docs/mcp.md`](docs/mcp.md).
+### End-to-end example apps
+
+Beyond the per-language quickstarts in [`examples/`](examples/), the SQLR-38 umbrella tracks longer, opinionated example apps that exercise SQLRite in real-world shapes:
+
+| App | SDK | What it shows |
+|---|---|---|
+| [Python LLM agent with persistent memory](examples/python-agent/) | Python | Vector + lexical recall, fact extraction, summaries — all in one `.sqlrite` file |
+| [Chat-with-your-notes via Claude Desktop MCP](examples/nodejs-notes/) | Node.js | Markdown → hybrid HNSW + BM25 index → `sqlrite-mcp --read-only` → Claude Desktop |
+
### Developer guide
In-depth documentation lives under [`docs/`](docs/). Start at [`docs/_index.md`](docs/_index.md) — it navigates to:
diff --git a/examples/README.md b/examples/README.md
index 040da24..c98ea78 100644
--- a/examples/README.md
+++ b/examples/README.md
@@ -23,6 +23,7 @@ Beyond the per-SDK quick-start tours above, the [SQLR-38 umbrella](../docs/roadm
| App | Language / SDK | What it shows | Directory |
|---|---|---|---|
| LLM agent with persistent memory | Python | Vector + lexical recall, fact extraction, summaries — all in one `.sqlrite` file | [`python-agent/`](python-agent/) |
+| Chat with your notes (MCP) | Node.js | Markdown → SQLRite hybrid retrieval, served to Claude Desktop via `sqlrite-mcp --read-only` | [`nodejs-notes/`](nodejs-notes/) |
## Running the Rust quickstart
@@ -82,6 +83,16 @@ python -m sqlrite_agent # works offline; no API key required
A full CLI chat agent whose long-term memory is one `.sqlrite` file. Embeds each turn, hybrid-searches over past messages and a structured `facts` table on every recall, and survives process restarts. Read [`python-agent/README.md`](python-agent/README.md) for the demo script and architecture diagram.
+## Running the Node.js notes assistant (SQLR-40)
+
+```bash
+cd examples/nodejs-notes
+npm install
+node bin/sqlrite-notes.mjs init ~/Documents/notes
+```
+
+Ingests a folder of markdown notes into a `notes.sqlrite` file with HNSW + BM25 indexes, then `sqlrite-notes serve` wraps `sqlrite-mcp --read-only` so **Claude Desktop / any MCP client** can `bm25_search` / `vector_search` / `query` / `ask` your local notes directly — no cloud sync, no third-party indexer. Default embedder is fully offline (deterministic hash bag-of-words); flip to `--embedder openai` with `OPENAI_API_KEY` set for real semantic recall. Read [`nodejs-notes/README.md`](nodejs-notes/README.md) for the Claude Desktop config snippet and the hybrid-retrieval SQL walkthrough.
+
## Running the Node.js sample
```bash
diff --git a/examples/nodejs-notes/.gitignore b/examples/nodejs-notes/.gitignore
new file mode 100644
index 0000000..5577b65
--- /dev/null
+++ b/examples/nodejs-notes/.gitignore
@@ -0,0 +1,6 @@
+node_modules/
+*.sqlrite
+*.sqlrite-journal
+.env
+.env.local
+test-fixtures-tmp/
diff --git a/examples/nodejs-notes/README.md b/examples/nodejs-notes/README.md
new file mode 100644
index 0000000..cd814a6
--- /dev/null
+++ b/examples/nodejs-notes/README.md
@@ -0,0 +1,344 @@
+# sqlrite-notes — chat with your markdown notes via Claude Desktop
+
+A Node.js CLI that ingests a folder of markdown notes (Obsidian
+vault, Notion export, plain `~/Documents/notes`) into a SQLRite
+database, then exposes the database to **Claude Desktop / any MCP
+client** through the engine's first-party MCP server.
+
+End-user effect: drop your notes folder in, paste one block into
+Claude Desktop's config, and ask Claude *"what did I write about
+CRDTs last month?"* — it answers using your local notes. No cloud
+sync, no third-party indexer, the entire memory is one `.sqlrite`
+file on disk you can open in the REPL.
+
+> **Why this example?** Other "chat with your notes" demos build a
+> custom RAG pipeline and bolt it onto a model. This one shows that
+> when the database itself speaks the agent protocol, you don't need
+> a pipeline — *Claude drives the database directly* via
+> `sqlrite-mcp`. The Node.js side is just the ingest + glue.
+
+## Architecture
+
+```mermaid
+flowchart LR
+ Notes[/"~/notes/*.md
(markdown)"/] -->|sqlrite-notes init / refresh| Ingest
+ Ingest["Ingest pipeline
(walk → chunk → embed → store)"] --> DB[("notes.sqlrite
documents · chunks
HNSW + FTS indexes")]
+ DB -->|"sqlrite-mcp --read-only
stdio JSON-RPC"| Claude["Claude Desktop
(or any MCP client)"]
+ Claude -->|"vector_search · bm25_search · query · ask"| DB
+```
+
+The whole stack: Node.js for the **write side** (ingest pipeline,
+chunking, embeddings), SQLRite for **storage + retrieval primitives**
+(HNSW vector index, BM25 inverted index, raw SQL), and `sqlrite-mcp`
+for the **read side** that Claude actually talks to. The Node CLI
+never touches the database while Claude is connected — that's what
+`--read-only` is for.
+
+## Schema (v1)
+
+| Table | Purpose | Indexes |
+|-------------|------------------------------------------------------------------------------------------------|----------------------------------------------------------------------|
+| `documents` | One row per `.md` file — path, title, mtime, full body, content hash. | UNIQUE on `path`. FTS on `content` (BM25 over whole docs). |
+| `chunks` | One row per ~400-token slice of a document, plus a `VECTOR(384)` embedding. | HNSW on `embedding` (semantic KNN). FTS on `content` (passage BM25). |
+
+Hybrid retrieval queries `chunks` and fuses BM25 + vector cosine in a
+single `ORDER BY` (see [`docs/fts.md`](../../docs/fts.md) for the
+SQL pattern; the executor's `try_fts_probe` hook serves the top-k
+straight from the inverted index).
+
+## Install
+
+The example lives inside the SQLRite monorepo for now (the umbrella
+ticket SQLR-38 will lift it into its own repo once we've shipped a
+few more).
+
+```bash
+git clone https://github.com/joaoh82/rust_sqlite
+cd rust_sqlite/examples/nodejs-notes
+npm install
+```
+
+`npm install` pulls **`@joaoh82/sqlrite`** (pinned to `^0.10.0`) with
+prebuilt napi-rs binaries for macOS-arm64, Linux x64/arm64, and
+Windows x64 — no Rust toolchain required for the Node side.
+
+`sqlrite-mcp` is a separate Rust binary. Install it once, anywhere
+on your `PATH`:
+
+```bash
+# from crates.io (~30s):
+cargo install sqlrite-mcp
+
+# or grab a prebuilt binary from GitHub Releases:
+# https://github.com/joaoh82/rust_sqlite/releases
+```
+
+If you don't want to install globally, set `SQLRITE_MCP_BIN` to its
+absolute path — `sqlrite-notes serve` will pick it up.
+
+## Run
+
+```bash
+# 1. Ingest a folder of markdown into a notes.sqlrite database.
+node bin/sqlrite-notes.mjs init ~/Documents/notes
+
+# 2. Confirm it works locally — same retrieval shape Claude will see.
+node bin/sqlrite-notes.mjs search "what did I learn about CRDTs?"
+
+# 3. Wire up Claude Desktop using the snippet printed by `init`
+# (also available any time via `sqlrite-notes config`).
+
+# 4. Open Claude Desktop. The sqlrite-mcp tools appear in the
+# tool picker — `bm25_search`, `vector_search`, `query`, `ask`,
+# plus `list_tables` / `describe_table` / `schema_dump`.
+```
+
+Once you've added the snippet to `claude_desktop_config.json` and
+restarted Claude Desktop, run a chat like:
+
+> *"Summarize what I've written about Postgres over the last month."*
+
+Claude will call `bm25_search` (and/or `vector_search`) against the
+`chunks` table, get back the matching passages, and answer with
+inline citations to the file path and chunk number.
+
+## Zero-config: works fully offline
+
+The default embedder is a **deterministic hash-based bag-of-words**
+embedder that runs in pure JavaScript. No API key, no network,
+nothing to install — `sqlrite-notes init ~/Documents/notes` works
+on a fresh laptop.
+
+Hybrid retrieval still beats either signal alone because BM25 is
+already doing exact-term ranking; the hash embedder mostly carries
+its weight via the long-tail of co-occurring tokens.
+
+For real semantic recall, switch to OpenAI:
+
+```bash
+export OPENAI_API_KEY=sk-...
+node bin/sqlrite-notes.mjs init ~/Documents/notes --embedder openai
+```
+
+Uses `text-embedding-3-small` with the `dimensions: 384` override
+so it matches the schema. Override the model with
+`--openai-model text-embedding-3-large` (and bump `--dim` if you
+want full-fat dimensionality).
+
+## CLI surface
+
+```
+sqlrite-notes init
" — debug retrieval the way Claude would over MCP. +// serve — spawn sqlrite-mcp --read-only against the DB. +// stats — quick row counts. +// config — print the Claude Desktop wiring snippet. +// +// Flag parsing uses node:util's parseArgs so we have no external +// dep just for argv handling. Each subcommand owns its own option +// schema. Unknown / missing args print usage. + +import { parseArgs } from 'node:util'; + +import { NotesDB } from './db.mjs'; +import { ingest, refresh } from './ingest.mjs'; +import { makeEmbedder } from './embeddings.mjs'; +import { search, renderResults } from './search.mjs'; +import { spawnMcpServer } from './serve.mjs'; +import { renderInstructions } from './claude-config.mjs'; +import { + resolveDbPath, + resolveDir, + defaultDbPath, + DEFAULT_EMBEDDING_DIM, + DEFAULT_CHUNK_TOKENS, + DEFAULT_CHUNK_OVERLAP, +} from './config.mjs'; + +const VERSION = '0.1.0'; + +const USAGE = `sqlrite-notes ${VERSION} — chat with your markdown notes via Claude Desktop + SQLRite MCP. + +Usage: + sqlrite-notes[options] + +Commands: + init Build (or rebuild) the notes index from . + refresh Incremental re-ingest based on file mtime/hash. + search " " Run hybrid retrieval against the index (debug). + serve Spawn sqlrite-mcp --read-only against the DB. + stats Print row counts. + config Print the Claude Desktop config snippet. + help Show this message. + +Common options: + --db Path to the SQLRite database file. + Default: ${defaultDbPath()} + --embedder hash|openai Embedding provider (default: hash, offline). + --dim Vector dimension (default: ${DEFAULT_EMBEDDING_DIM}). + --openai-model OpenAI embedding model (default: text-embedding-3-small). + +Init / refresh options: + --chunk-tokens Target chunk size in tokens (default: ${DEFAULT_CHUNK_TOKENS}). + --chunk-overlap Chunk overlap in tokens (default: ${DEFAULT_CHUNK_OVERLAP}). + +Search options: + -k Number of results to return (default: 5). + -w <0..1> BM25 vs vector weight (default: 0.5). + +Environment: + OPENAI_API_KEY Required when --embedder openai. + SQLRITE_NOTES_EMBEDDER Default embedder (hash | openai). + SQLRITE_NOTES_OPENAI_MODEL Override OpenAI model id. + SQLRITE_MCP_BIN Explicit path to sqlrite-mcp for 'serve'. +`; + +/** + * Entry point. Returns the process exit code (0 = OK). + * + * @param {string[]} argv arguments after `node bin/sqlrite-notes.mjs` + */ +export async function run(argv) { + const [command, ...rest] = argv; + if (!command || command === 'help' || command === '--help' || command === '-h') { + process.stdout.write(USAGE); + return 0; + } + if (command === 'version' || command === '--version' || command === '-V') { + process.stdout.write(`sqlrite-notes ${VERSION}\n`); + return 0; + } + + switch (command) { + case 'init': + return cmdInit(rest); + case 'refresh': + return cmdRefresh(rest); + case 'search': + return cmdSearch(rest); + case 'serve': + return cmdServe(rest); + case 'stats': + return cmdStats(rest); + case 'config': + return cmdConfig(rest); + default: + process.stderr.write(`unknown command: ${command}\n\n`); + process.stderr.write(USAGE); + return 2; + } +} + +// ------------------------------------------------------------------ +// init + +async function cmdInit(argv) { + const { values, positionals } = parseArgs({ + args: argv, + allowPositionals: true, + options: { + db: { type: 'string' }, + embedder: { type: 'string' }, + dim: { type: 'string' }, + 'openai-model': { type: 'string' }, + 'chunk-tokens': { type: 'string' }, + 'chunk-overlap': { type: 'string' }, + }, + }); + if (positionals.length === 0) { + process.stderr.write('init: missing \n\nusage: sqlrite-notes init [--db path] [--embedder hash|openai]\n'); + return 2; + } + const root = resolveDir(positionals[0]); + const dbPath = resolveDbPath(values.db); + const dim = parseDim(values.dim); + const embedder = makeEmbedder({ + kind: values.embedder, + dim, + model: values['openai-model'], + }); + const db = new NotesDB(dbPath, { dim: embedder.dim }); + + try { + process.stdout.write(`sqlrite-notes ${VERSION}\n`); + process.stdout.write(` db: ${dbPath}\n`); + process.stdout.write(` source: ${root}\n`); + process.stdout.write(` embedder: ${embedder.name} (dim=${embedder.dim})\n`); + + const stats = await ingest({ + db, + root, + embedder, + logger: (s) => process.stdout.write(`${s}\n`), + chunkOpts: parseChunkOpts(values), + }); + process.stdout.write(`\ningested ${stats.files} file(s), ${stats.chunks} chunk(s) in ${stats.elapsedMs} ms\n`); + process.stdout.write('\n'); + process.stdout.write(renderInstructions({ dbPath })); + process.stdout.write('\n'); + return 0; + } finally { + db.close(); + } +} + +// ------------------------------------------------------------------ +// refresh + +async function cmdRefresh(argv) { + const { values, positionals } = parseArgs({ + args: argv, + allowPositionals: true, + options: { + db: { type: 'string' }, + embedder: { type: 'string' }, + dim: { type: 'string' }, + 'openai-model': { type: 'string' }, + 'chunk-tokens': { type: 'string' }, + 'chunk-overlap': { type: 'string' }, + source: { type: 'string' }, + }, + }); + // is optional for refresh — if omitted, we re-ingest the same + // tree that init recorded. (For now we just require it; we don't + // store the source dir in the DB. Documented in the README.) + const rootArg = values.source ?? positionals[0]; + if (!rootArg) { + process.stderr.write( + 'refresh: pass the source directory as a positional (or --source ).\n' + + 'We don\'t yet persist the source path inside the DB — see the README\n' + + '"Known simplifications" section.\n', + ); + return 2; + } + const root = resolveDir(rootArg); + const dbPath = resolveDbPath(values.db); + const dim = parseDim(values.dim); + const embedder = makeEmbedder({ + kind: values.embedder, + dim, + model: values['openai-model'], + }); + const db = new NotesDB(dbPath, { dim: embedder.dim }); + try { + const stats = await refresh({ + db, + root, + embedder, + logger: (s) => process.stdout.write(`${s}\n`), + chunkOpts: parseChunkOpts(values), + }); + process.stdout.write( + `refreshed: ${stats.files} updated, ${stats.skipped} unchanged, ${stats.deleted} deleted (${stats.elapsedMs} ms)\n`, + ); + return 0; + } finally { + db.close(); + } +} + +// ------------------------------------------------------------------ +// search + +async function cmdSearch(argv) { + const { values, positionals } = parseArgs({ + args: argv, + allowPositionals: true, + options: { + db: { type: 'string' }, + embedder: { type: 'string' }, + dim: { type: 'string' }, + 'openai-model': { type: 'string' }, + k: { type: 'string', short: 'k' }, + w: { type: 'string', short: 'w' }, + }, + }); + const query = positionals.join(' ').trim(); + if (!query) { + process.stderr.write('search: missing query string.\n\nusage: sqlrite-notes search " " [-k N] [-w 0..1]\n'); + return 2; + } + const dbPath = resolveDbPath(values.db); + const dim = parseDim(values.dim); + const embedder = makeEmbedder({ + kind: values.embedder, + dim, + model: values['openai-model'], + }); + const db = new NotesDB(dbPath, { dim: embedder.dim, readOnly: true }); + try { + const hits = await search({ + db, + embedder, + query, + k: parseInt2(values.k, 5), + weight: parseFloat2(values.w, 0.5), + }); + process.stdout.write(renderResults(query, hits)); + return 0; + } finally { + db.close(); + } +} + +// ------------------------------------------------------------------ +// serve + +async function cmdServe(argv) { + const { values } = parseArgs({ + args: argv, + options: { + db: { type: 'string' }, + }, + }); + const dbPath = resolveDbPath(values.db); + // sqlrite-mcp opens its own database, so we don't touch it here — + // just pass the resolved path through. + const code = await spawnMcpServer({ dbPath }); + return code; +} + +// ------------------------------------------------------------------ +// stats + +async function cmdStats(argv) { + const { values } = parseArgs({ + args: argv, + options: { + db: { type: 'string' }, + }, + }); + const dbPath = resolveDbPath(values.db); + const db = new NotesDB(dbPath, { readOnly: true }); + try { + const s = db.stats(); + process.stdout.write(`db: ${dbPath}\n`); + process.stdout.write(`documents: ${s.documents}\n`); + process.stdout.write(`chunks: ${s.chunks}\n`); + process.stdout.write(`embedding dim: ${s.dim}\n`); + return 0; + } finally { + db.close(); + } +} + +// ------------------------------------------------------------------ +// config + +async function cmdConfig(argv) { + const { values } = parseArgs({ + args: argv, + options: { + db: { type: 'string' }, + bin: { type: 'string' }, + }, + }); + const dbPath = resolveDbPath(values.db); + process.stdout.write(renderInstructions({ dbPath, binPath: values.bin })); + process.stdout.write('\n'); + return 0; +} + +// ------------------------------------------------------------------ +// shared option parsing + +function parseDim(raw) { + if (raw === undefined) return DEFAULT_EMBEDDING_DIM; + const n = parseInt(raw, 10); + if (!Number.isFinite(n) || n <= 0) { + throw new Error(`--dim: invalid value ${JSON.stringify(raw)}`); + } + return n; +} + +function parseChunkOpts(values) { + return { + targetTokens: parseInt2(values['chunk-tokens'], DEFAULT_CHUNK_TOKENS), + overlapTokens: parseInt2(values['chunk-overlap'], DEFAULT_CHUNK_OVERLAP), + }; +} + +function parseInt2(raw, fallback) { + if (raw === undefined) return fallback; + const n = parseInt(raw, 10); + if (!Number.isFinite(n) || n < 0) { + throw new Error(`invalid integer: ${JSON.stringify(raw)}`); + } + return n; +} + +function parseFloat2(raw, fallback) { + if (raw === undefined) return fallback; + const n = parseFloat(raw); + if (!Number.isFinite(n)) { + throw new Error(`invalid number: ${JSON.stringify(raw)}`); + } + return n; +} diff --git a/examples/nodejs-notes/src/config.mjs b/examples/nodejs-notes/src/config.mjs new file mode 100644 index 0000000..1984be1 --- /dev/null +++ b/examples/nodejs-notes/src/config.mjs @@ -0,0 +1,71 @@ +// Defaults + small helpers around config paths and the database +// location. Everything is overridable via flags on the CLI; this +// module just picks reasonable fallbacks. + +import { homedir, platform } from 'node:os'; +import { resolve, join } from 'node:path'; + +export const DEFAULT_EMBEDDING_DIM = 384; +export const DEFAULT_CHUNK_TOKENS = 400; +export const DEFAULT_CHUNK_OVERLAP = 60; + +/** Resolve the default DB path: ~/.sqlrite-notes/notes.sqlrite */ +export function defaultDbPath() { + return join(homedir(), '.sqlrite-notes', 'notes.sqlrite'); +} + +/** + * Resolve a user-supplied directory path. Expands `~` and resolves + * relative paths against the current working directory. + * + * @param {string} input + * @returns {string} + */ +export function resolveDir(input) { + if (!input) throw new Error('resolveDir(): empty path'); + let expanded = input; + if (expanded === '~' || expanded.startsWith('~/')) { + expanded = join(homedir(), expanded.slice(1)); + } + return resolve(expanded); +} + +/** + * Resolve a user-supplied DB path. Same expansion rules as + * `resolveDir` but doesn't require the parent directory to exist — + * the caller (db.mjs) will mkdir as needed. + * + * @param {string | undefined} input + * @returns {string} + */ +export function resolveDbPath(input) { + return resolveDir(input ?? defaultDbPath()); +} + +/** + * Best-guess location of Claude Desktop's config file. + * Used only for the `init`'s "wire me up" hint — we never read or + * write the file from the CLI. + * + * @returns {string} + */ +export function claudeDesktopConfigPath() { + if (platform() === 'darwin') { + return join( + homedir(), + 'Library', + 'Application Support', + 'Claude', + 'claude_desktop_config.json', + ); + } + if (platform() === 'win32') { + const appData = + process.env.APPDATA ?? join(homedir(), 'AppData', 'Roaming'); + return join(appData, 'Claude', 'claude_desktop_config.json'); + } + // Linux — Claude Desktop's Linux build is in beta; this is the + // documented path. Falls back to ~/.config if XDG_CONFIG_HOME unset. + const xdg = process.env.XDG_CONFIG_HOME ?? join(homedir(), '.config'); + return join(xdg, 'Claude', 'claude_desktop_config.json'); +} diff --git a/examples/nodejs-notes/src/db.mjs b/examples/nodejs-notes/src/db.mjs new file mode 100644 index 0000000..f57ce5e --- /dev/null +++ b/examples/nodejs-notes/src/db.mjs @@ -0,0 +1,400 @@ +// SQLRite-backed storage for the notes index. +// +// Owns the schema, migrations, and every SQL string in the project. +// Higher-level modules (ingest.mjs, search.mjs) call into `NotesDB` +// rather than touching SQL directly. +// +// Schema v1 — two tables: +// +// documents(id, path, title, mtime, content, content_hash) +// FTS index on `content`. +// +// chunks(id, document_id, ord, content, embedding VECTOR(dim)) +// HNSW index on `embedding`, FTS index on `content`. +// +// One row per file in `documents`; one row per ~400-token slice in +// `chunks`. Hybrid retrieval queries `chunks` (vector + BM25, fused +// at the SQL level) and joins back to `documents` for path / title. + +import { mkdirSync } from 'node:fs'; +import { dirname } from 'node:path'; + +import { Database } from '@joaoh82/sqlrite'; + +import { q, ident } from './sqlutil.mjs'; +import { DEFAULT_EMBEDDING_DIM } from './config.mjs'; + +const SCHEMA_VERSION = 1; + +export class NotesDB { + /** + * Open or create a notes database at `path`. Pass `:memory:` for a + * transient store (useful in tests). + * + * @param {string} path + * @param {{ dim?: number, readOnly?: boolean }} [opts] + */ + constructor(path, opts = {}) { + this.path = path; + this.dim = opts.dim ?? DEFAULT_EMBEDDING_DIM; + + if (path !== ':memory:') { + mkdirSync(dirname(path), { recursive: true }); + } + + this._db = opts.readOnly + ? Database.openReadOnly(path) + : new Database(path); + + if (!opts.readOnly) { + this._migrate(); + } + } + + // ------------------------------------------------------------------ + // Migrations + + _migrate() { + const cur = this._db; + let current = 0; + try { + const row = cur.prepare('SELECT version FROM schema_version').get(); + current = row?.version ?? 0; + } catch { + // schema_version table doesn't exist yet — fresh database. + cur.exec( + 'CREATE TABLE schema_version (version INTEGER PRIMARY KEY)', + ); + cur.exec( + `INSERT INTO schema_version (version) VALUES (${q(0)})`, + ); + } + + if (current < 1) { + this._applyV1(); + cur.exec(`DELETE FROM schema_version`); + cur.exec( + `INSERT INTO schema_version (version) VALUES (${q(SCHEMA_VERSION)})`, + ); + } + } + + _applyV1() { + const dim = this.dim; + this._db.exec(` + CREATE TABLE documents ( + id INTEGER PRIMARY KEY, + path TEXT NOT NULL UNIQUE, + title TEXT NOT NULL, + mtime INTEGER NOT NULL, + content TEXT NOT NULL, + content_hash TEXT NOT NULL + ) + `); + this._db.exec(` + CREATE TABLE chunks ( + id INTEGER PRIMARY KEY, + document_id INTEGER NOT NULL, + ord INTEGER NOT NULL, + content TEXT NOT NULL, + embedding VECTOR(${dim}) + ) + `); + // FTS indexes give us BM25 ranking via `bm25_score(col, 'q')` — + // both documents.content (whole-document hits) and chunks.content + // (passage-level hits) are useful surfaces. + this._db.exec('CREATE INDEX idx_documents_fts ON documents USING fts (content)'); + this._db.exec('CREATE INDEX idx_chunks_fts ON chunks USING fts (content)'); + // HNSW for semantic KNN over chunk embeddings. + this._db.exec('CREATE INDEX idx_chunks_emb ON chunks USING hnsw (embedding)'); + } + + // ------------------------------------------------------------------ + // Writes + + /** + * Upsert a document by `path`. Returns `{ id, replaced }` — `replaced` + * is true if a previous version of the document was removed first. + * + * Chunks are NOT touched here; the caller is responsible for calling + * `replaceChunks(id, ...)` after re-chunking + re-embedding. + * + * @param {{ path: string, title: string, mtime: number, content: string, contentHash: string }} doc + * @returns {{ id: number, replaced: boolean }} + */ + upsertDocument(doc) { + const existing = this._db + .prepare(`SELECT id FROM documents WHERE path = ${q(doc.path)}`) + .get(); + let replaced = false; + + if (existing) { + replaced = true; + // Delete existing chunks first — referential consistency. + this._db.exec(`DELETE FROM chunks WHERE document_id = ${q(existing.id)}`); + this._db.exec(`DELETE FROM documents WHERE id = ${q(existing.id)}`); + } + + this._db.exec( + `INSERT INTO documents (path, title, mtime, content, content_hash) VALUES (` + + `${q(doc.path)}, ${q(doc.title)}, ${q(doc.mtime)}, ${q(doc.content)}, ${q(doc.contentHash)})`, + ); + const inserted = this._db + .prepare(`SELECT id FROM documents WHERE path = ${q(doc.path)}`) + .get(); + if (!inserted) throw new Error('upsertDocument(): row vanished after INSERT'); + return { id: inserted.id, replaced }; + } + + /** + * Insert one chunk row. Embedding must match `this.dim`. + * + * @param {{ documentId: number, ord: number, content: string, embedding: number[] }} chunk + */ + insertChunk({ documentId, ord, content, embedding }) { + if (embedding.length !== this.dim) { + throw new Error( + `insertChunk(): embedding dim ${embedding.length} ≠ schema dim ${this.dim}`, + ); + } + this._db.exec( + `INSERT INTO chunks (document_id, ord, content, embedding) VALUES (` + + `${q(documentId)}, ${q(ord)}, ${q(content)}, ${q(embedding)})`, + ); + } + + /** + * Drop a document and every chunk pointing at it. + * + * @param {number} documentId + */ + deleteDocument(documentId) { + this._db.exec(`DELETE FROM chunks WHERE document_id = ${q(documentId)}`); + this._db.exec(`DELETE FROM documents WHERE id = ${q(documentId)}`); + } + + // ------------------------------------------------------------------ + // Reads + + /** + * Map of path → { id, mtime, content_hash }. Used by `refresh` to + * decide which files changed. + * + * @returns {Map } + */ + listDocuments() { + const rows = this._db + .prepare('SELECT id, path, mtime, content_hash FROM documents') + .all(); + const map = new Map(); + for (const r of rows) { + map.set(r.path, { + id: r.id, + mtime: r.mtime, + contentHash: r.content_hash, + }); + } + return map; + } + + /** + * Hybrid top-k search over chunks. Combines BM25 lexical with vector + * cosine in a single `ORDER BY` (see `docs/fts.md`). + * + * If `query` produces no FTS tokens (e.g. a single non-ASCII word), + * we fall back to vector-only ranking — otherwise the FTS pre-filter + * would return an empty set. + * + * @param {{ query: string, embedding: number[], k?: number, weight?: number }} args + * @returns {Array<{ chunk_id: number, document_id: number, path: string, title: string, ord: number, content: string }>} + */ + hybridSearch({ query, embedding, k = 5, weight = 0.5 }) { + if (embedding.length !== this.dim) { + throw new Error( + `hybridSearch(): embedding dim ${embedding.length} ≠ schema dim ${this.dim}`, + ); + } + const tokens = ftsTokenize(query); + const ftsQuery = tokens.join(' '); + const w = clamp01(weight); + + let chunkRows; + if (ftsQuery.length === 0) { + chunkRows = this._db + .prepare( + `SELECT id, document_id, ord, content FROM chunks ` + + `ORDER BY vec_distance_cosine(embedding, ${q(embedding)}) ASC ` + + `LIMIT ${q(k)}`, + ) + .all(); + } else { + // Hybrid: fts_match pre-filters, ORDER BY fuses BM25 + cosine. + chunkRows = this._db + .prepare( + `SELECT id, document_id, ord, content FROM chunks ` + + `WHERE fts_match(content, ${q(ftsQuery)}) ` + + `ORDER BY ${q(w)} * bm25_score(content, ${q(ftsQuery)}) ` + + `+ ${q(1 - w)} * (1.0 - vec_distance_cosine(embedding, ${q(embedding)})) ` + + `DESC LIMIT ${q(k)}`, + ) + .all(); + // If FTS pre-filter happened to find nothing (every token is + // unknown to the index), fall back to vector-only so the agent + // always gets *some* recall to ground on. + if (chunkRows.length === 0) { + chunkRows = this._db + .prepare( + `SELECT id, document_id, ord, content FROM chunks ` + + `ORDER BY vec_distance_cosine(embedding, ${q(embedding)}) ASC ` + + `LIMIT ${q(k)}`, + ) + .all(); + } + } + + return chunkRows.map((row) => { + const doc = this._db + .prepare( + `SELECT path, title FROM documents WHERE id = ${q(row.document_id)}`, + ) + .get(); + return { + chunk_id: row.id, + document_id: row.document_id, + path: doc?.path ?? '', + title: doc?.title ?? '', + ord: row.ord, + content: row.content, + }; + }); + } + + /** + * BM25 top-k over `documents.content` — useful for the debug + * `search --mode=bm25-docs` shape. + * + * @param {string} query + * @param {number} k + */ + bm25DocumentsSearch(query, k = 5) { + const tokens = ftsTokenize(query); + if (tokens.length === 0) return []; + const ftsQuery = tokens.join(' '); + return this._db + .prepare( + `SELECT id, path, title FROM documents ` + + `WHERE fts_match(content, ${q(ftsQuery)}) ` + + `ORDER BY bm25_score(content, ${q(ftsQuery)}) DESC ` + + `LIMIT ${q(k)}`, + ) + .all(); + } + + /** Quick row counts for `stats`. */ + stats() { + const dRow = this._db.prepare('SELECT COUNT(*) AS c FROM documents').get(); + const cRow = this._db.prepare('SELECT COUNT(*) AS c FROM chunks').get(); + return { + documents: Number(dRow?.c ?? 0), + chunks: Number(cRow?.c ?? 0), + dim: this.dim, + }; + } + + // ------------------------------------------------------------------ + // Transactions + + /** + * Run `fn` inside a single transaction. Commits on success, rolls + * back on any thrown error. Synchronous — the engine is sync. + * + * @template T + * @param {() => T} fn + * @returns {T} + */ + transaction(fn) { + this._db.exec('BEGIN'); + try { + const result = fn(); + this._db.exec('COMMIT'); + return result; + } catch (err) { + try { + this._db.exec('ROLLBACK'); + } catch { + // Ignore — the engine is in an unknown state; surface the + // original error to the caller. + } + throw err; + } + } + + /** Raw escape hatch — used by tests for ad-hoc SQL. */ + raw() { + return this._db; + } + + /** + * Close the underlying engine connection and re-open it at the same + * path. Used by the ingest pipeline to work around the engine's + * HNSW-after-delete bug (see the example's README). After this + * call the wrapper still works exactly as before — only the + * underlying connection is fresh, which forces a clean index + * rebuild on the next read. + * + * @param {{ readOnly?: boolean }} [opts] + */ + reopen(opts = {}) { + if (this.path === ':memory:') { + throw new Error('reopen(): in-memory databases cannot be reopened (state would be lost)'); + } + this._db.close(); + this._db = opts.readOnly + ? Database.openReadOnly(this.path) + : new Database(this.path); + } + + close() { + this._db.close(); + } +} + +// ------------------------------------------------------------------ +// FTS tokenizer mirror. +// +// The engine's FTS tokenizer (docs/fts.md) splits on `[^A-Za-z0-9]+` +// and lowercases. We replicate it in JS so we can pre-check whether a +// query string would yield any tokens — if not, the FTS WHERE clause +// matches nothing and we should fall back to vector-only. + +const TOKEN_RE = /[A-Za-z0-9]+/g; +const STOPWORDS = new Set([ + 'a', 'an', 'and', 'or', 'the', 'is', 'are', 'was', 'were', 'be', 'been', + 'in', 'on', 'at', 'to', 'of', 'for', 'with', 'by', 'as', 'it', 'this', + 'that', 'these', 'those', 'i', 'you', 'we', 'they', 'he', 'she', +]); + +/** + * Tokenize a query the same way the engine's FTS tokenizer would, + * then drop a tiny stop-list to avoid `fts_match` ballooning into a + * full-table scan on filler words. (The engine has no stop list of + * its own — that's intentional, see `docs/fts.md`. But for retrieval + * we definitely don't want "the" + "is" to drive ranking.) + * + * @param {string} text + * @returns {string[]} + */ +export function ftsTokenize(text) { + if (!text) return []; + const matches = text.match(TOKEN_RE) ?? []; + return matches + .map((t) => t.toLowerCase()) + .filter((t) => t.length > 1 && !STOPWORDS.has(t)); +} + +function clamp01(x) { + if (!Number.isFinite(x)) return 0.5; + if (x < 0) return 0; + if (x > 1) return 1; + return x; +} diff --git a/examples/nodejs-notes/src/embeddings.mjs b/examples/nodejs-notes/src/embeddings.mjs new file mode 100644 index 0000000..bcdb179 --- /dev/null +++ b/examples/nodejs-notes/src/embeddings.mjs @@ -0,0 +1,152 @@ +// Embedding-provider abstractions. +// +// Two providers: +// +// 1. `hash` (default, offline) — a token-bag hash embedder that +// lets users run the whole pipeline without an API key. +// Quality is bag-of-words-ish; good for demos and tests, not +// for production RAG. +// +// 2. `openai` — `text-embedding-3-small`. Pinned to the `dimensions` +// override so we stay at 384 dims for compatibility with the +// schema (and for parity with the python-agent example). +// +// All providers share the same surface: `await embed(text)` returns +// a `number[]` of `provider.dim` items. + +import { DEFAULT_EMBEDDING_DIM } from './config.mjs'; + +/** + * @typedef {object} Embedder + * @property {string} name + * @property {number} dim + * @property {(text: string) => Promise } embed + */ + +/** + * Build an embedder by name. Throws if the configuration is invalid + * (e.g. `openai` without `OPENAI_API_KEY`). + * + * @param {{ kind?: string, dim?: number, model?: string, apiKey?: string, fetchFn?: typeof fetch }} opts + * @returns {Embedder} + */ +export function makeEmbedder(opts = {}) { + const kind = opts.kind ?? process.env.SQLRITE_NOTES_EMBEDDER ?? 'hash'; + const dim = opts.dim ?? DEFAULT_EMBEDDING_DIM; + if (kind === 'hash') return makeHashEmbedder(dim); + if (kind === 'openai') { + const apiKey = opts.apiKey ?? process.env.OPENAI_API_KEY; + if (!apiKey) { + throw new Error( + 'openai embedder: set OPENAI_API_KEY (or pass --embedder hash to run offline).', + ); + } + const model = opts.model ?? process.env.SQLRITE_NOTES_OPENAI_MODEL ?? 'text-embedding-3-small'; + return makeOpenAIEmbedder({ + apiKey, + model, + dim, + fetchFn: opts.fetchFn ?? fetch, + }); + } + throw new Error(`unknown embedder kind: ${JSON.stringify(kind)} (expected "hash" or "openai")`); +} + +// ------------------------------------------------------------------ +// Hash embedder +// +// Deterministic, zero-dependency, offline. Maps each whitespace +// token through a tiny FNV-1a hash into one of `dim` slots, scales +// by token frequency, then L2-normalizes the result so cosine +// similarity is meaningful. + +/** + * @param {number} dim + * @returns {Embedder} + */ +export function makeHashEmbedder(dim) { + return { + name: 'hash', + dim, + async embed(text) { + const vec = new Float64Array(dim); + const tokens = (text || '').toLowerCase().match(/[a-z0-9]+/g) ?? []; + for (const tok of tokens) { + const slot = fnv1a32(tok) % dim; + vec[slot] += 1; + } + // L2 normalize (zero-safe). + let sumSq = 0; + for (let i = 0; i < vec.length; i++) sumSq += vec[i] * vec[i]; + const norm = Math.sqrt(sumSq); + if (norm === 0) return Array.from(vec); + const out = new Array(dim); + for (let i = 0; i < vec.length; i++) out[i] = vec[i] / norm; + return out; + }, + }; +} + +function fnv1a32(s) { + // Classic FNV-1a 32-bit, returns a non-negative integer. + let h = 0x811c9dc5; + for (let i = 0; i < s.length; i++) { + h ^= s.charCodeAt(i); + h = Math.imul(h, 0x01000193); + } + return h >>> 0; +} + +// ------------------------------------------------------------------ +// OpenAI embedder + +/** + * @param {{ apiKey: string, model: string, dim: number, fetchFn: typeof fetch }} args + * @returns {Embedder} + */ +export function makeOpenAIEmbedder({ apiKey, model, dim, fetchFn }) { + return { + name: `openai/${model}`, + dim, + async embed(text) { + const body = JSON.stringify({ + model, + input: text, + dimensions: dim, + }); + const res = await fetchFn('https://api.openai.com/v1/embeddings', { + method: 'POST', + headers: { + 'content-type': 'application/json', + authorization: `Bearer ${apiKey}`, + }, + body, + }); + if (!res.ok) { + const detail = await safeText(res); + throw new Error( + `OpenAI embeddings API error ${res.status}: ${detail.slice(0, 300)}`, + ); + } + const json = await res.json(); + const vec = json?.data?.[0]?.embedding; + if (!Array.isArray(vec)) { + throw new Error('OpenAI embeddings: malformed response (no data[0].embedding)'); + } + if (vec.length !== dim) { + throw new Error( + `OpenAI embeddings: returned ${vec.length} dims, expected ${dim}`, + ); + } + return vec; + }, + }; +} + +async function safeText(res) { + try { + return await res.text(); + } catch { + return ''; + } +} diff --git a/examples/nodejs-notes/src/ingest.mjs b/examples/nodejs-notes/src/ingest.mjs new file mode 100644 index 0000000..2c4cfcf --- /dev/null +++ b/examples/nodejs-notes/src/ingest.mjs @@ -0,0 +1,249 @@ +// Markdown → SQLRite ingest pipeline. +// +// Walks a directory of `.md` / `.markdown` files, chunks each one, +// embeds every chunk, and writes documents + chunks into the DB. +// Two entry points: +// +// - `ingest(...)` — full reindex of a directory. Used by `init`. +// - `refresh(...)` — incremental: skip files whose mtime + content +// hash haven't changed since the last run. Used by `refresh`. +// +// Both flow through `ingestImpl`, which splits the work into three +// phases: PLAN (read-only diff against the current DB) → DELETE (drop +// stale documents/chunks; close + reopen the DB) → INSERT (write new +// rows). The close/reopen between DELETE and INSERT is a workaround +// for an engine bug where the HNSW chunk index panics when rows are +// deleted and re-inserted in the same connection lifetime — see the +// "Known limitations" section of this example's README. + +import { readFile, readdir, stat } from 'node:fs/promises'; +import { createHash } from 'node:crypto'; +import { join, relative, basename, extname } from 'node:path'; + +import { + stripFrontmatter, + deriveTitle, + chunkMarkdown, +} from './chunker.mjs'; +import { DEFAULT_CHUNK_TOKENS, DEFAULT_CHUNK_OVERLAP } from './config.mjs'; + +/** + * @typedef {object} IngestStats + * @property {number} files + * @property {number} chunks + * @property {number} skipped + * @property {number} deleted + * @property {number} elapsedMs + */ + +/** + * Find every markdown file under `root` (recursive). Ignores hidden + * directories (`.git`, `.obsidian`, etc.) and `node_modules` so a + * dropped-in Obsidian vault doesn't suck in junk. + * + * @param {string} root + * @returns {Promise } + */ +export async function findMarkdownFiles(root) { + const out = []; + await walk(root, out); + out.sort(); + return out; +} + +async function walk(dir, out) { + let entries; + try { + entries = await readdir(dir, { withFileTypes: true }); + } catch { + return; // root may not exist; let the caller surface the message. + } + for (const ent of entries) { + const full = join(dir, ent.name); + if (ent.isDirectory()) { + if (ent.name.startsWith('.') || ent.name === 'node_modules') continue; + await walk(full, out); + continue; + } + if (!ent.isFile()) continue; + const ext = extname(ent.name).toLowerCase(); + if (ext === '.md' || ext === '.markdown') out.push(full); + } +} + +/** + * Re-ingest every file under `root` — replaces any existing rows for + * the same path. Use for the `init` flow. + * + * @param {{ db: NotesDB, root: string, embedder: import('./embeddings.mjs').Embedder, logger?: (s: string) => void, chunkOpts?: { targetTokens?: number, overlapTokens?: number } }} args + * @returns {Promise } + */ +export async function ingest(args) { + return ingestImpl({ ...args, mode: 'full' }); +} + +/** + * Incremental re-ingest. Skips files whose mtime + content hash + * matches what's already in the DB. Deletes documents whose file is + * gone from disk. + * + * @param {{ db: NotesDB, root: string, embedder: import('./embeddings.mjs').Embedder, logger?: (s: string) => void, chunkOpts?: { targetTokens?: number, overlapTokens?: number } }} args + * @returns {Promise } + */ +export async function refresh(args) { + return ingestImpl({ ...args, mode: 'incremental' }); +} + +/** + * @param {{ db: NotesDB, root: string, embedder: import('./embeddings.mjs').Embedder, logger?: (s: string) => void, chunkOpts?: { targetTokens?: number, overlapTokens?: number }, mode: 'full' | 'incremental' }} args + * @returns {Promise } + */ +async function ingestImpl({ db, root, embedder, logger, chunkOpts, mode }) { + const log = logger ?? (() => {}); + const t0 = Date.now(); + const target = chunkOpts?.targetTokens ?? DEFAULT_CHUNK_TOKENS; + const overlap = chunkOpts?.overlapTokens ?? DEFAULT_CHUNK_OVERLAP; + + const files = await findMarkdownFiles(root); + if (files.length === 0) { + log(`no markdown files found under ${root}`); + return { files: 0, chunks: 0, skipped: 0, deleted: 0, elapsedMs: 0 }; + } + + // ---------------------------------------------------------------- + // PHASE 1 — plan. Read the current DB state, hash each on-disk + // file, build the change set. No writes yet. + const existing = db.listDocuments(); + /** @type {Array<{ relPath: string, abs: string, mtime: number, text: string, hash: string, priorId: number | null }>} */ + const planUpserts = []; + /** @type {number[]} */ + const planDeletes = []; + let skipped = 0; + const seenPaths = new Set(); + + for (const abs of files) { + const rel = relative(root, abs); + const text = await readFile(abs, 'utf8'); + const fstat = await stat(abs); + const mtime = Math.floor(fstat.mtimeMs / 1000); + const hash = sha256Hex(text); + seenPaths.add(rel); + const prior = existing.get(rel); + + if (mode === 'incremental' && prior && prior.mtime === mtime && prior.contentHash === hash) { + skipped++; + continue; + } + planUpserts.push({ + relPath: rel, + abs, + mtime, + text, + hash, + priorId: prior?.id ?? null, + }); + } + // Files that vanished from disk — only when refreshing. + if (mode === 'incremental') { + for (const [path, prior] of existing) { + if (!seenPaths.has(path)) planDeletes.push(prior.id); + } + } + // Full ingest implicitly replaces every existing doc that we're + // re-ingesting. Drop docs no longer present on disk too, so a + // re-run of `init` against a different source dir doesn't leave + // orphans behind. + if (mode === 'full') { + for (const [path, prior] of existing) { + if (!seenPaths.has(path)) planDeletes.push(prior.id); + } + } + + // Embed BEFORE touching the DB. If anything throws here (e.g. a + // network embedding call fails) we haven't mutated anything. + /** @type {Array<{ plan: typeof planUpserts[number], title: string, body: string, chunks: Array<{ ord: number, content: string, embedding: number[] }> }>} */ + const embedded = []; + let totalEmbedded = 0; + for (const p of planUpserts) { + const { frontmatter, body } = stripFrontmatter(p.text); + const title = deriveTitle({ + frontmatter, + body, + fallback: basename(p.abs, extname(p.abs)), + }); + const chunks = chunkMarkdown(body, { targetTokens: target, overlapTokens: overlap }); + if (chunks.length === 0) { + log(`skipped empty: ${p.relPath}`); + continue; + } + const embeds = []; + for (const c of chunks) { + const v = await embedder.embed(c.content); + embeds.push({ ord: c.ord, content: c.content, embedding: v }); + totalEmbedded++; + } + embedded.push({ plan: p, title, body, chunks: embeds }); + if (embedded.length % 10 === 0) { + log(` embedded ${embedded.length}/${planUpserts.length} files (${totalEmbedded} chunks)…`); + } + } + + const hasMutations = planDeletes.length > 0 || embedded.some((e) => e.plan.priorId !== null); + + // ---------------------------------------------------------------- + // PHASE 2 — deletes (and replacing-deletes). + // + // The engine's HNSW index has a bug where rows deleted and re- + // inserted within the same connection lifetime can corrupt the + // index's stored vectors (see ../README.md "Known limitations"). + // Closing + reopening the connection between the delete-pass and + // the insert-pass forces a full index rebuild on next open, + // sidestepping the issue. We only pay this cost when there's + // actually something to delete; pure-INSERT runs (first `init`) + // skip this hop entirely. + if (hasMutations) { + db.transaction(() => { + for (const id of planDeletes) db.deleteDocument(id); + for (const e of embedded) { + if (e.plan.priorId !== null) db.deleteDocument(e.plan.priorId); + } + }); + db.reopen(); + } + + // ---------------------------------------------------------------- + // PHASE 3 — inserts. + let totalChunks = 0; + for (const e of embedded) { + db.transaction(() => { + const { id } = db.upsertDocument({ + path: e.plan.relPath, + title: e.title, + mtime: e.plan.mtime, + content: e.body, + contentHash: e.plan.hash, + }); + for (const c of e.chunks) { + db.insertChunk({ + documentId: id, + ord: c.ord, + content: c.content, + embedding: c.embedding, + }); + } + }); + totalChunks += e.chunks.length; + } + + return { + files: embedded.length, + chunks: totalChunks, + skipped, + deleted: planDeletes.length, + elapsedMs: Date.now() - t0, + }; +} + +function sha256Hex(input) { + return createHash('sha256').update(input).digest('hex'); +} diff --git a/examples/nodejs-notes/src/search.mjs b/examples/nodejs-notes/src/search.mjs new file mode 100644 index 0000000..650acf1 --- /dev/null +++ b/examples/nodejs-notes/src/search.mjs @@ -0,0 +1,51 @@ +// Hybrid retrieval driver for the `search` debug command. +// +// Same shape an LLM would get over MCP through `vector_search` + +// `bm25_search`, but with rendered output for humans. + +/** + * @param {{ db: import('./db.mjs').NotesDB, embedder: import('./embeddings.mjs').Embedder, query: string, k?: number, weight?: number }} args + */ +export async function search({ db, embedder, query, k = 5, weight = 0.5 }) { + const embedding = await embedder.embed(query); + return db.hybridSearch({ query, embedding, k, weight }); +} + +/** + * Render a list of search results as a human-friendly string. + * + * @param {string} query + * @param {ReturnType } hits + */ +export function renderResults(query, hits) { + if (hits.length === 0) { + return `no results for: ${JSON.stringify(query)}\n`; + } + const lines = []; + lines.push(`top ${hits.length} hits for: ${JSON.stringify(query)}`); + lines.push(''); + for (let i = 0; i < hits.length; i++) { + const h = hits[i]; + const head = h.title ? `${h.title} — ${h.path}` : h.path; + lines.push(`${pad(i + 1)}. ${head} (chunk ${h.ord})`); + lines.push(indent(truncate(h.content, 280))); + lines.push(''); + } + return lines.join('\n'); +} + +function pad(n) { + return String(n).padStart(2, ' '); +} + +function indent(text) { + return text + .split(/\r?\n/) + .map((l) => ` ${l}`) + .join('\n'); +} + +function truncate(text, max) { + const t = text.replace(/\s+/g, ' ').trim(); + return t.length <= max ? t : `${t.slice(0, max - 1)}…`; +} diff --git a/examples/nodejs-notes/src/serve.mjs b/examples/nodejs-notes/src/serve.mjs new file mode 100644 index 0000000..43247f9 --- /dev/null +++ b/examples/nodejs-notes/src/serve.mjs @@ -0,0 +1,113 @@ +// `serve` — spawn `sqlrite-mcp --read-only ` with stdio inherited. +// +// The point of this command is to remove the "find the binary, then +// write the right `args` array" step from Claude Desktop config: +// users wire ONE thing (`sqlrite-notes serve`) and never have to +// know where `sqlrite-mcp` lives. The MCP client speaks JSON-RPC +// over our stdio; we just shovel it to/from the child. + +import { spawn } from 'node:child_process'; +import { existsSync } from 'node:fs'; +import { join } from 'node:path'; +import { homedir } from 'node:os'; + +/** + * Try a sequence of well-known locations to find a `sqlrite-mcp` + * binary. Order: + * + * 1. `SQLRITE_MCP_BIN` env var (explicit override). + * 2. `which sqlrite-mcp` via `PATH`. + * 3. `~/.cargo/bin/sqlrite-mcp` (cargo install default). + * + * @returns {string | null} + */ +export function locateMcpBinary() { + const env = process.env.SQLRITE_MCP_BIN; + if (env) { + if (!existsSync(env)) { + throw new Error( + `SQLRITE_MCP_BIN=${env} is set but the file doesn't exist.`, + ); + } + return env; + } + + // PATH lookup. `process.env.PATH` is the only thing we can portably + // check without shelling out; spawning `which` adds latency for no + // benefit since `spawn(name)` will already use PATH on Unix. + const pathDirs = (process.env.PATH ?? '').split(process.platform === 'win32' ? ';' : ':'); + const exeName = process.platform === 'win32' ? 'sqlrite-mcp.exe' : 'sqlrite-mcp'; + for (const dir of pathDirs) { + if (!dir) continue; + const candidate = join(dir, exeName); + if (existsSync(candidate)) return candidate; + } + + // Cargo install fallback. + const cargoBin = join(homedir(), '.cargo', 'bin', exeName); + if (existsSync(cargoBin)) return cargoBin; + + return null; +} + +/** + * Spawn `sqlrite-mcp --read-only ` with stdio inherited. Returns + * a Promise that resolves with the child's exit code. + * + * @param {{ dbPath: string, extraArgs?: string[], stderr?: NodeJS.WritableStream }} args + * @returns {Promise } + */ +export function spawnMcpServer({ dbPath, extraArgs = [], stderr }) { + const bin = locateMcpBinary(); + if (!bin) { + throw new Error( + 'sqlrite-mcp binary not found.\n' + + '\n' + + 'Install it one of these ways:\n' + + ' cargo install sqlrite-mcp\n' + + ' # or download from https://github.com/joaoh82/rust_sqlite/releases\n' + + '\n' + + 'You can also override the lookup with SQLRITE_MCP_BIN=/path/to/sqlrite-mcp.\n', + ); + } + + // Build args. `--read-only` is the whole reason this wrapper exists: + // we never want Claude (or any other MCP client) to mutate the notes + // DB out from under the ingest pipeline. + const args = [dbPath, '--read-only', ...extraArgs]; + + return new Promise((resolve, reject) => { + const child = spawn(bin, args, { + // stdin / stdout MUST be inherited so the MCP client can talk to + // the child directly. stderr we pipe to wherever the caller asks + // (default: our own stderr). + stdio: ['inherit', 'inherit', stderr ? 'pipe' : 'inherit'], + env: process.env, + }); + if (stderr && child.stderr) { + child.stderr.pipe(stderr); + } + child.on('error', reject); + child.on('exit', (code, signal) => { + if (signal) { + // Propagate the signal as a non-zero exit code so Claude + // Desktop sees the failure cleanly. + resolve(128 + (signalToNumber(signal) ?? 1)); + } else { + resolve(code ?? 0); + } + }); + // Forward SIGINT / SIGTERM to the child so Ctrl-C in the parent + // shuts the child down rather than orphaning it. + const forward = (sig) => { + if (!child.killed) child.kill(sig); + }; + process.once('SIGINT', () => forward('SIGINT')); + process.once('SIGTERM', () => forward('SIGTERM')); + }); +} + +function signalToNumber(sig) { + const map = { SIGINT: 2, SIGTERM: 15, SIGKILL: 9, SIGHUP: 1 }; + return map[sig]; +} diff --git a/examples/nodejs-notes/src/sqlutil.mjs b/examples/nodejs-notes/src/sqlutil.mjs new file mode 100644 index 0000000..33aa1ba --- /dev/null +++ b/examples/nodejs-notes/src/sqlutil.mjs @@ -0,0 +1,63 @@ +// Tiny SQL-literal helpers — the SQLRite engine doesn't support +// `?`-style parameter binding yet (Phase 5a.2 follow-up), so every +// caller must inline values as SQL literals. This module is the +// single place that does that safely. +// +// Mirrors the shape of `sqlrite_agent.sqlutil` in the Python example. + +/** + * Quote a JavaScript value as a SQL literal. + * + * - string → `'escaped'` (single quotes doubled per the SQL standard) + * - number → integer or `Number.prototype.toString()` for finite floats + * - boolean → `TRUE` / `FALSE` + * - null/undefined → `NULL` + * - number[] → `[v1, v2, ...]` — the engine's vector literal syntax + * + * Anything else throws — refuse to silently `String()` an object. + * + * @param {unknown} value + * @returns {string} + */ +export function q(value) { + if (value === null || value === undefined) return 'NULL'; + if (typeof value === 'string') return `'${value.replaceAll("'", "''")}'`; + if (typeof value === 'number') { + if (!Number.isFinite(value)) { + throw new TypeError(`q(): non-finite number ${value}`); + } + return Number.isInteger(value) ? String(value) : value.toString(); + } + if (typeof value === 'bigint') return value.toString(); + if (typeof value === 'boolean') return value ? 'TRUE' : 'FALSE'; + if (Array.isArray(value)) { + // Vector literal — every element must be finite numeric. + const parts = value.map((v, i) => { + if (typeof v !== 'number' || !Number.isFinite(v)) { + throw new TypeError(`q(): vector element ${i} is not a finite number (got ${v})`); + } + // toString() emits the shortest round-trippable form; the + // engine's parser accepts both fixed-point and exponential. + return v.toString(); + }); + return `[${parts.join(', ')}]`; + } + throw new TypeError(`q(): unsupported value type ${typeof value}`); +} + +/** + * Validate a SQL identifier (table / column / index name) against the + * unquoted-identifier subset the engine accepts. Throws if invalid. + * + * Use this for ANY identifier that ultimately gets inlined into SQL — + * callers shouldn't have to guess what's safe. + * + * @param {string} name + * @returns {string} the same name (for chaining) + */ +export function ident(name) { + if (typeof name !== 'string' || !/^[A-Za-z_][A-Za-z0-9_]*$/.test(name)) { + throw new TypeError(`ident(): invalid SQL identifier ${JSON.stringify(name)}`); + } + return name; +} diff --git a/examples/nodejs-notes/test/chunker.test.mjs b/examples/nodejs-notes/test/chunker.test.mjs new file mode 100644 index 0000000..d6faac2 --- /dev/null +++ b/examples/nodejs-notes/test/chunker.test.mjs @@ -0,0 +1,85 @@ +import test from 'node:test'; +import assert from 'node:assert/strict'; + +import { + stripFrontmatter, + deriveTitle, + chunkMarkdown, + approxTokens, +} from '../src/chunker.mjs'; + +test('stripFrontmatter — YAML between --- fences', () => { + const text = '---\ntitle: Foo\ntags: [a]\n---\n\nBody.'; + const { frontmatter, body } = stripFrontmatter(text); + assert.match(frontmatter, /^title: Foo/); + assert.equal(body.trim(), 'Body.'); +}); + +test('stripFrontmatter — no frontmatter passes through', () => { + const text = '# Heading\n\nBody.'; + const { frontmatter, body } = stripFrontmatter(text); + assert.equal(frontmatter, ''); + assert.equal(body, text); +}); + +test('deriveTitle — picks frontmatter title first', () => { + assert.equal( + deriveTitle({ frontmatter: 'title: Hello World', body: '# Other', fallback: 'fb' }), + 'Hello World', + ); +}); + +test('deriveTitle — falls back to first heading', () => { + assert.equal( + deriveTitle({ frontmatter: '', body: '# My Heading\n\nbody.', fallback: 'fb' }), + 'My Heading', + ); +}); + +test('deriveTitle — falls back to filename stem', () => { + assert.equal(deriveTitle({ frontmatter: '', body: 'no heading', fallback: 'fb' }), 'fb'); +}); + +test('approxTokens — whitespace word count', () => { + assert.equal(approxTokens(''), 0); + assert.equal(approxTokens('one two three'), 3); + assert.equal(approxTokens(' a b '), 2); +}); + +test('chunkMarkdown — single short doc fits in one chunk', () => { + const out = chunkMarkdown('# Title\n\nA short paragraph.\n'); + assert.equal(out.length, 1); + assert.match(out[0].content, /Title/); + assert.match(out[0].content, /short paragraph/); +}); + +test('chunkMarkdown — long doc splits into multiple chunks', () => { + const big = Array.from({ length: 20 }, (_, i) => `Paragraph ${i}: ${'word '.repeat(60)}`).join( + '\n\n', + ); + const out = chunkMarkdown(`# Heading\n\n${big}`, { targetTokens: 200, overlapTokens: 20 }); + assert.ok(out.length > 1, `expected >1 chunk, got ${out.length}`); + // Each chunk should be non-empty. + for (const c of out) { + assert.ok(c.content.length > 0); + } +}); + +test('chunkMarkdown — overlap copies tail tokens forward', () => { + const body = + '# A\n\n' + + Array.from({ length: 5 }, (_, i) => `alpha${i} ${'lorem '.repeat(80)}`).join('\n\n'); + const out = chunkMarkdown(body, { targetTokens: 100, overlapTokens: 30 }); + assert.ok(out.length >= 2); + // The second chunk should contain the tail of the first. + const firstTail = out[0].content.split(/\s+/).slice(-20).join(' '); + assert.ok( + out[1].content.includes(firstTail.split(' ').slice(-5).join(' ')), + 'second chunk should overlap with the tail of the first', + ); +}); + +test('chunkMarkdown — empty body produces no chunks', () => { + assert.deepEqual(chunkMarkdown(''), []); + assert.deepEqual(chunkMarkdown(' \n\n '), []); +}); diff --git a/examples/nodejs-notes/test/claude-config.test.mjs b/examples/nodejs-notes/test/claude-config.test.mjs new file mode 100644 index 0000000..d5e7217 --- /dev/null +++ b/examples/nodejs-notes/test/claude-config.test.mjs @@ -0,0 +1,36 @@ +import test from 'node:test'; +import assert from 'node:assert/strict'; + +import { buildConfig, renderInstructions } from '../src/claude-config.mjs'; + +test('buildConfig — default command is "sqlrite-notes"', () => { + const cfg = buildConfig({ dbPath: '/tmp/notes.sqlrite' }); + assert.deepEqual(cfg, { + mcpServers: { + 'sqlrite-notes': { + command: 'sqlrite-notes', + args: ['serve', '--db', '/tmp/notes.sqlrite'], + }, + }, + }); +}); + +test('buildConfig — explicit binPath wins', () => { + const cfg = buildConfig({ + dbPath: '/tmp/notes.sqlrite', + binPath: '/opt/sqlrite-notes/bin/sqlrite-notes.mjs', + }); + assert.equal( + cfg.mcpServers['sqlrite-notes'].command, + '/opt/sqlrite-notes/bin/sqlrite-notes.mjs', + ); +}); + +test('renderInstructions — embeds JSON block and Claude Desktop path', () => { + const out = renderInstructions({ dbPath: '/tmp/notes.sqlrite' }); + assert.match(out, /mcpServers/); + assert.match(out, /sqlrite-notes/); + assert.match(out, /"command": "sqlrite-notes"/); + assert.match(out, /serve/); + assert.match(out, /modelcontextprotocol/); // inspector hint +}); diff --git a/examples/nodejs-notes/test/db.test.mjs b/examples/nodejs-notes/test/db.test.mjs new file mode 100644 index 0000000..210dddf --- /dev/null +++ b/examples/nodejs-notes/test/db.test.mjs @@ -0,0 +1,272 @@ +// Integration tests against a real SQLRite Connection. These require +// the @joaoh82/sqlrite Node binding to be built/installed; if it +// isn't, the suite emits a skip notice instead of failing. + +import test from 'node:test'; +import assert from 'node:assert/strict'; +import { mkdtempSync, rmSync } from 'node:fs'; +import { tmpdir } from 'node:os'; +import { join } from 'node:path'; + +let NotesDB; +let skipReason = null; +try { + ({ NotesDB } = await import('../src/db.mjs')); +} catch (err) { + skipReason = `cannot import db.mjs (build the Node SDK first?): ${err.message}`; +} + +// Node 24's test runner treats `{ skip: null }` as a skip directive +// (the key's presence matters more than its value), so use this +// helper to conditionally pass the option only when we genuinely +// want to skip. +const maybeSkip = skipReason ? { skip: skipReason } : {}; + +function withDb(fn) { + const dir = mkdtempSync(join(tmpdir(), 'sqlrite-notes-test-')); + const path = join(dir, 'notes.sqlrite'); + try { + return fn({ dir, path }); + } finally { + rmSync(dir, { recursive: true, force: true }); + } +} + +test('db: schema applies cleanly + stats start at zero', maybeSkip, () => { + withDb(({ path }) => { + const db = new NotesDB(path, { dim: 16 }); + try { + const s = db.stats(); + assert.equal(s.documents, 0); + assert.equal(s.chunks, 0); + assert.equal(s.dim, 16); + } finally { + db.close(); + } + }); +}); + +test('db: upsertDocument + insertChunk round-trip', maybeSkip, () => { + withDb(({ path }) => { + const db = new NotesDB(path, { dim: 4 }); + try { + const { id, replaced } = db.upsertDocument({ + path: 'a.md', + title: 'A', + mtime: 100, + content: 'rust embedded database notes', + contentHash: 'h1', + }); + assert.ok(id > 0); + assert.equal(replaced, false); + + db.insertChunk({ + documentId: id, + ord: 0, + content: 'rust embedded database notes', + embedding: [1, 0, 0, 0], + }); + + const s = db.stats(); + assert.equal(s.documents, 1); + assert.equal(s.chunks, 1); + } finally { + db.close(); + } + }); +}); + +test('db: upsertDocument replaces prior version + cascades chunks', maybeSkip, () => { + withDb(({ path }) => { + const db = new NotesDB(path, { dim: 4 }); + try { + const v1 = db.upsertDocument({ + path: 'a.md', + title: 'A', + mtime: 100, + content: 'one', + contentHash: 'h1', + }); + db.insertChunk({ + documentId: v1.id, + ord: 0, + content: 'one', + embedding: [1, 0, 0, 0], + }); + + const v2 = db.upsertDocument({ + path: 'a.md', + title: 'A v2', + mtime: 200, + content: 'two', + contentHash: 'h2', + }); + assert.equal(v2.replaced, true); + assert.notEqual(v2.id, v1.id); + + const s = db.stats(); + assert.equal(s.documents, 1); + assert.equal(s.chunks, 0); // old chunk got dropped on replace + } finally { + db.close(); + } + }); +}); + +test( + 'db: hybridSearch returns vector + BM25 hits in a sensible order', + maybeSkip, + () => { + withDb(({ path }) => { + const db = new NotesDB(path, { dim: 4 }); + try { + const { id: dA } = db.upsertDocument({ + path: 'a.md', + title: 'A', + mtime: 1, + content: 'rust embedded database', + contentHash: 'h1', + }); + const { id: dB } = db.upsertDocument({ + path: 'b.md', + title: 'B', + mtime: 2, + content: 'distributed systems and consensus protocols', + contentHash: 'h2', + }); + // Two chunks each, with distinct embeddings. + db.insertChunk({ + documentId: dA, + ord: 0, + content: 'rust embedded database', + embedding: [1, 0, 0, 0], + }); + db.insertChunk({ + documentId: dB, + ord: 0, + content: 'distributed systems and consensus protocols', + embedding: [0, 1, 0, 0], + }); + + // A query whose embedding aligns with chunk A and whose + // tokens overlap chunk A — A should win. + const hits = db.hybridSearch({ + query: 'rust database', + embedding: [1, 0, 0, 0], + k: 2, + }); + assert.ok(hits.length >= 1); + assert.equal(hits[0].path, 'a.md'); + } finally { + db.close(); + } + }); + }, +); + +test( + 'db: hybridSearch falls back to vector-only when FTS tokens are empty', + maybeSkip, + () => { + withDb(({ path }) => { + const db = new NotesDB(path, { dim: 4 }); + try { + const { id } = db.upsertDocument({ + path: 'a.md', + title: 'A', + mtime: 1, + content: 'rust embedded database', + contentHash: 'h1', + }); + db.insertChunk({ + documentId: id, + ord: 0, + content: 'rust embedded database', + embedding: [1, 0, 0, 0], + }); + const hits = db.hybridSearch({ + query: '日本語', // every byte non-ASCII → no FTS tokens + embedding: [1, 0, 0, 0], + k: 5, + }); + assert.equal(hits.length, 1); + assert.equal(hits[0].path, 'a.md'); + } finally { + db.close(); + } + }); + }, +); + +test('db: deleteDocument cascades to chunks', maybeSkip, () => { + withDb(({ path }) => { + const db = new NotesDB(path, { dim: 4 }); + try { + const { id } = db.upsertDocument({ + path: 'a.md', + title: 'A', + mtime: 1, + content: 'x', + contentHash: 'h', + }); + db.insertChunk({ + documentId: id, + ord: 0, + content: 'x', + embedding: [1, 0, 0, 0], + }); + db.deleteDocument(id); + const s = db.stats(); + assert.equal(s.documents, 0); + assert.equal(s.chunks, 0); + } finally { + db.close(); + } + }); +}); + +test('db: listDocuments → path map', maybeSkip, () => { + withDb(({ path }) => { + const db = new NotesDB(path, { dim: 4 }); + try { + db.upsertDocument({ + path: 'a.md', + title: 'A', + mtime: 100, + content: 'x', + contentHash: 'h1', + }); + db.upsertDocument({ + path: 'b.md', + title: 'B', + mtime: 200, + content: 'y', + contentHash: 'h2', + }); + const map = db.listDocuments(); + assert.equal(map.size, 2); + assert.equal(map.get('a.md')?.mtime, 100); + assert.equal(map.get('b.md')?.contentHash, 'h2'); + } finally { + db.close(); + } + }); +}); + +test('db: re-open path-backed DB reads back data', maybeSkip, () => { + withDb(({ path }) => { + const a = new NotesDB(path, { dim: 4 }); + a.upsertDocument({ path: 'a.md', title: 'A', mtime: 1, content: 'x', contentHash: 'h' }); + a.close(); + + const b = new NotesDB(path, { dim: 4 }); + try { + const s = b.stats(); + assert.equal(s.documents, 1); + const map = b.listDocuments(); + assert.equal(map.get('a.md')?.mtime, 1); + } finally { + b.close(); + } + }); +}); diff --git a/examples/nodejs-notes/test/embeddings.test.mjs b/examples/nodejs-notes/test/embeddings.test.mjs new file mode 100644 index 0000000..7612123 --- /dev/null +++ b/examples/nodejs-notes/test/embeddings.test.mjs @@ -0,0 +1,104 @@ +import test from 'node:test'; +import assert from 'node:assert/strict'; + +import { makeHashEmbedder, makeOpenAIEmbedder, makeEmbedder } from '../src/embeddings.mjs'; + +test('hash embedder — deterministic, unit-norm, requested dim', async () => { + const emb = makeHashEmbedder(384); + const a = await emb.embed('rust embedded database'); + const b = await emb.embed('rust embedded database'); + assert.equal(a.length, 384); + assert.deepEqual(a, b); + const norm = Math.sqrt(a.reduce((s, x) => s + x * x, 0)); + assert.ok(Math.abs(norm - 1) < 1e-9 || norm === 0, `unit norm, got ${norm}`); +}); + +test('hash embedder — different inputs produce different vectors', async () => { + const emb = makeHashEmbedder(64); + const a = await emb.embed('alpha beta gamma'); + const b = await emb.embed('delta epsilon zeta'); + assert.notDeepEqual(a, b); +}); + +test('hash embedder — empty text returns zero vector', async () => { + const emb = makeHashEmbedder(32); + const a = await emb.embed(''); + assert.equal(a.length, 32); + assert.ok(a.every((x) => x === 0)); +}); + +test('makeEmbedder defaults to hash', () => { + const emb = makeEmbedder({ dim: 16 }); + assert.equal(emb.name, 'hash'); + assert.equal(emb.dim, 16); +}); + +test('makeEmbedder openai without API key throws clear error', () => { + const prev = process.env.OPENAI_API_KEY; + delete process.env.OPENAI_API_KEY; + try { + assert.throws( + () => makeEmbedder({ kind: 'openai', dim: 384 }), + /OPENAI_API_KEY/, + ); + } finally { + if (prev !== undefined) process.env.OPENAI_API_KEY = prev; + } +}); + +test('makeEmbedder unknown kind throws', () => { + assert.throws(() => makeEmbedder({ kind: 'word2vec', dim: 8 }), /unknown embedder/); +}); + +test('openai embedder talks to a mocked fetch and validates shape', async () => { + const calls = []; + const fakeFetch = async (url, init) => { + calls.push({ url, init }); + return new Response( + JSON.stringify({ data: [{ embedding: new Array(8).fill(0.5) }] }), + { status: 200, headers: { 'content-type': 'application/json' } }, + ); + }; + const emb = makeOpenAIEmbedder({ + apiKey: 'sk-test', + model: 'text-embedding-3-small', + dim: 8, + fetchFn: fakeFetch, + }); + const v = await emb.embed('hello world'); + assert.equal(v.length, 8); + assert.equal(calls.length, 1); + assert.equal(calls[0].url, 'https://api.openai.com/v1/embeddings'); + const body = JSON.parse(calls[0].init.body); + assert.equal(body.model, 'text-embedding-3-small'); + assert.equal(body.dimensions, 8); + assert.equal(body.input, 'hello world'); + assert.match(calls[0].init.headers.authorization, /^Bearer /); +}); + +test('openai embedder surfaces API errors', async () => { + const fakeFetch = async () => + new Response('rate limited', { status: 429 }); + const emb = makeOpenAIEmbedder({ + apiKey: 'sk-test', + model: 'm', + dim: 4, + fetchFn: fakeFetch, + }); + await assert.rejects(emb.embed('x'), /OpenAI embeddings API error 429/); +}); + +test('openai embedder rejects wrong-dim responses', async () => { + const fakeFetch = async () => + new Response(JSON.stringify({ data: [{ embedding: [1, 2, 3] }] }), { + status: 200, + headers: { 'content-type': 'application/json' }, + }); + const emb = makeOpenAIEmbedder({ + apiKey: 'sk-test', + model: 'm', + dim: 8, + fetchFn: fakeFetch, + }); + await assert.rejects(emb.embed('x'), /returned 3 dims, expected 8/); +}); diff --git a/examples/nodejs-notes/test/fixtures/crdts.md b/examples/nodejs-notes/test/fixtures/crdts.md new file mode 100644 index 0000000..72b5a89 --- /dev/null +++ b/examples/nodejs-notes/test/fixtures/crdts.md @@ -0,0 +1,23 @@ +--- +title: CRDTs for collaborative editing +tags: [crdt, distributed-systems] +--- + +# CRDTs for collaborative editing + +Conflict-free replicated data types let two clients edit the same +document offline and merge the result deterministically. Two flavors +matter: state-based (CvRDTs) and operation-based (CmRDTs). + +## When to reach for one + +If your network is unreliable but you can ship every state mutation +through a message broker, CmRDTs win: smaller payloads, fewer wasted +bytes. + +## What still bites you + +Causality tracking — vector clocks specifically — grows with the +number of replicas. Yjs and Automerge invest a lot of effort in +compressing those metadata structures so they don't dominate the on- +wire payload for long-lived documents. diff --git a/examples/nodejs-notes/test/fixtures/postgres.md b/examples/nodejs-notes/test/fixtures/postgres.md new file mode 100644 index 0000000..76bfd8e --- /dev/null +++ b/examples/nodejs-notes/test/fixtures/postgres.md @@ -0,0 +1,18 @@ +# Notes on Postgres + +Postgres is a relational database server with extension hooks for +storage formats and access methods. The reason it's the default +SQL engine for new projects is the combination of MVCC, +PL/pgSQL, and a permissive license. + +## What I keep forgetting + +Subtransactions are cheap up to a point, then VERY expensive — the +SLRU buffers become the bottleneck. If you find yourself with +nested savepoints in a hot path, audit them. + +## Replication + +Streaming replication via WAL shipping is the default. Logical +replication via decoded WAL records is more flexible but the +publication / subscription dance has more moving parts. diff --git a/examples/nodejs-notes/test/fixtures/running.md b/examples/nodejs-notes/test/fixtures/running.md new file mode 100644 index 0000000..2729fcc --- /dev/null +++ b/examples/nodejs-notes/test/fixtures/running.md @@ -0,0 +1,8 @@ +# Marathon training journal + +Week 1: did a 10-mile long run on Sunday. Tempo run on Wednesday felt +flat — probably under-fuelled. Keep an eye on carb intake the day +before a quality session. + +Week 2: hill repeats on Tuesday went well. The 16-mile long run was +the first one with no GI issues since switching gels. diff --git a/examples/nodejs-notes/test/ingest.test.mjs b/examples/nodejs-notes/test/ingest.test.mjs new file mode 100644 index 0000000..c356477 --- /dev/null +++ b/examples/nodejs-notes/test/ingest.test.mjs @@ -0,0 +1,131 @@ +// End-to-end ingest + search test against the test/fixtures notes. +// Skips cleanly if the @joaoh82/sqlrite Node binding isn't built. + +import test from 'node:test'; +import assert from 'node:assert/strict'; +import { + mkdtempSync, + mkdirSync, + rmSync, + writeFileSync, + utimesSync, + unlinkSync, +} from 'node:fs'; +import { tmpdir } from 'node:os'; +import { join, dirname } from 'node:path'; +import { fileURLToPath } from 'node:url'; + +import { makeHashEmbedder } from '../src/embeddings.mjs'; + +let NotesDB, ingest, refresh, search; +let skipReason = null; +try { + ({ NotesDB } = await import('../src/db.mjs')); + ({ ingest, refresh } = await import('../src/ingest.mjs')); + ({ search } = await import('../src/search.mjs')); +} catch (err) { + skipReason = `cannot import (build the Node SDK first?): ${err.message}`; +} + +const maybeSkip = skipReason ? { skip: skipReason } : {}; + +const here = dirname(fileURLToPath(import.meta.url)); +const fixturesDir = join(here, 'fixtures'); + +test( + 'ingest fixtures → search recalls the right note for each query', + maybeSkip, + async () => { + const dir = mkdtempSync(join(tmpdir(), 'sqlrite-notes-itest-')); + const path = join(dir, 'notes.sqlrite'); + try { + const embedder = makeHashEmbedder(64); + const db = new NotesDB(path, { dim: embedder.dim }); + try { + const stats = await ingest({ db, root: fixturesDir, embedder }); + assert.ok(stats.files >= 3, `expected ≥3 files, got ${stats.files}`); + assert.ok(stats.chunks >= 3, `expected ≥3 chunks, got ${stats.chunks}`); + + const crdtHits = await search({ + db, + embedder, + query: 'collaborative editing CRDT', + k: 3, + }); + assert.ok(crdtHits.length > 0); + assert.equal(crdtHits[0].path, 'crdts.md'); + + const pgHits = await search({ + db, + embedder, + query: 'WAL replication', + k: 3, + }); + assert.ok(pgHits.length > 0); + assert.equal(pgHits[0].path, 'postgres.md'); + + const runHits = await search({ + db, + embedder, + query: 'marathon training long run', + k: 3, + }); + assert.ok(runHits.length > 0); + assert.equal(runHits[0].path, 'running.md'); + } finally { + db.close(); + } + } finally { + rmSync(dir, { recursive: true, force: true }); + } + }, +); + +test( + 'refresh: unchanged files are skipped; changed files re-embedded; deleted removed', + maybeSkip, + async () => { + const dir = mkdtempSync(join(tmpdir(), 'sqlrite-notes-itest-')); + const sourceDir = join(dir, 'notes'); + const dbPath = join(dir, 'notes.sqlrite'); + try { + mkdirSync(sourceDir, { recursive: true }); + writeFileSync(join(sourceDir, 'keep.md'), '# Keep\n\nshould stay verbatim.\n'); + writeFileSync( + join(sourceDir, 'change.md'), + '# Change\n\noriginal body about postgres.\n', + ); + writeFileSync(join(sourceDir, 'remove.md'), '# Remove\n\nwill be deleted later.\n'); + + const embedder = makeHashEmbedder(32); + const db = new NotesDB(dbPath, { dim: embedder.dim }); + try { + const first = await ingest({ db, root: sourceDir, embedder }); + assert.equal(first.files, 3); + + writeFileSync( + join(sourceDir, 'change.md'), + '# Change\n\nrewritten body about distributed systems.\n', + ); + const futureSec = Math.floor(Date.now() / 1000) + 5; + utimesSync(join(sourceDir, 'change.md'), futureSec, futureSec); + unlinkSync(join(sourceDir, 'remove.md')); + + const second = await refresh({ db, root: sourceDir, embedder }); + assert.equal(second.files, 1, 'one file changed'); + assert.equal(second.skipped, 1, 'one file unchanged'); + assert.equal(second.deleted, 1, 'one file removed'); + + const docs = db.listDocuments(); + assert.equal(docs.size, 2); + assert.ok(docs.has('keep.md')); + assert.ok(docs.has('change.md')); + assert.ok(!docs.has('remove.md')); + } finally { + db.close(); + } + } finally { + rmSync(dir, { recursive: true, force: true }); + } + }, +); diff --git a/examples/nodejs-notes/test/serve.test.mjs b/examples/nodejs-notes/test/serve.test.mjs new file mode 100644 index 0000000..2de78b5 --- /dev/null +++ b/examples/nodejs-notes/test/serve.test.mjs @@ -0,0 +1,38 @@ +import test from 'node:test'; +import assert from 'node:assert/strict'; +import { writeFileSync, chmodSync, mkdtempSync, rmSync } from 'node:fs'; +import { tmpdir } from 'node:os'; +import { join } from 'node:path'; + +import { locateMcpBinary } from '../src/serve.mjs'; + +test('locateMcpBinary honors SQLRITE_MCP_BIN when the file exists', () => { + const dir = mkdtempSync(join(tmpdir(), 'sqlrite-mcp-bin-test-')); + try { + const fakeBin = join(dir, 'fake-mcp'); + writeFileSync(fakeBin, '#!/bin/sh\necho fake\n'); + chmodSync(fakeBin, 0o755); + + const prev = process.env.SQLRITE_MCP_BIN; + process.env.SQLRITE_MCP_BIN = fakeBin; + try { + assert.equal(locateMcpBinary(), fakeBin); + } finally { + if (prev === undefined) delete process.env.SQLRITE_MCP_BIN; + else process.env.SQLRITE_MCP_BIN = prev; + } + } finally { + rmSync(dir, { recursive: true, force: true }); + } +}); + +test('locateMcpBinary throws if SQLRITE_MCP_BIN points at a missing file', () => { + const prev = process.env.SQLRITE_MCP_BIN; + process.env.SQLRITE_MCP_BIN = '/definitely/not/real/sqlrite-mcp'; + try { + assert.throws(() => locateMcpBinary(), /SQLRITE_MCP_BIN/); + } finally { + if (prev === undefined) delete process.env.SQLRITE_MCP_BIN; + else process.env.SQLRITE_MCP_BIN = prev; + } +}); diff --git a/examples/nodejs-notes/test/sqlutil.test.mjs b/examples/nodejs-notes/test/sqlutil.test.mjs new file mode 100644 index 0000000..7e617d0 --- /dev/null +++ b/examples/nodejs-notes/test/sqlutil.test.mjs @@ -0,0 +1,47 @@ +import test from 'node:test'; +import assert from 'node:assert/strict'; + +import { q, ident } from '../src/sqlutil.mjs'; + +test('q strings — basic and quote-doubling', () => { + assert.equal(q('hello'), "'hello'"); + assert.equal(q("it's"), "'it''s'"); + assert.equal(q("a'b'c"), "'a''b''c'"); + assert.equal(q(''), "''"); +}); + +test('q numbers — ints, floats, throws on NaN/Inf', () => { + assert.equal(q(0), '0'); + assert.equal(q(42), '42'); + assert.equal(q(-7), '-7'); + assert.equal(q(1.5), '1.5'); + assert.throws(() => q(NaN), TypeError); + assert.throws(() => q(Infinity), TypeError); +}); + +test('q booleans + null', () => { + assert.equal(q(true), 'TRUE'); + assert.equal(q(false), 'FALSE'); + assert.equal(q(null), 'NULL'); + assert.equal(q(undefined), 'NULL'); +}); + +test('q vector — bracket-array literal', () => { + assert.equal(q([0.1, 0.2, 0.3]), '[0.1, 0.2, 0.3]'); + assert.equal(q([]), '[]'); + assert.throws(() => q([0.1, 'x']), TypeError); + assert.throws(() => q([NaN]), TypeError); +}); + +test('q rejects objects', () => { + assert.throws(() => q({}), TypeError); +}); + +test('ident — accepts only the engine\'s unquoted-identifier subset', () => { + assert.equal(ident('users'), 'users'); + assert.equal(ident('_x9'), '_x9'); + assert.throws(() => ident('1users'), TypeError); + assert.throws(() => ident('users; DROP TABLE x'), TypeError); + assert.throws(() => ident('hello world'), TypeError); + assert.throws(() => ident(''), TypeError); +}); diff --git a/web/src/app/examples/page.tsx b/web/src/app/examples/page.tsx index 1819a13..bec7f24 100644 --- a/web/src/app/examples/page.tsx +++ b/web/src/app/examples/page.tsx @@ -48,6 +48,18 @@ const itemListJsonLd = { "A CLI chat agent whose long-term memory is a single .sqlrite file. Vector recall via HNSW, lexical recall via BM25, and a structured facts table for deterministic retrieval.", }, }, + { + "@type": "ListItem", + position: 2, + item: { + "@type": "SoftwareSourceCode", + name: "Chat with your notes — Node.js + Claude Desktop MCP", + url: `${SITE.repo}/tree/main/examples/nodejs-notes`, + programmingLanguage: "JavaScript", + description: + "A Node.js CLI that ingests a folder of markdown notes into SQLRite (HNSW + BM25 indexes), then exposes the database to Claude Desktop via sqlrite-mcp --read-only. Hybrid retrieval over your notes from inside the chat client.", + }, + }, ], }; @@ -77,6 +89,22 @@ const EXAMPLES: Example[] = [ repoPath: "examples/python-agent", features: ["HNSW", "VECTOR(384)", "BM25 / FTS", "PyO3 SDK"], }, + { + status: "shipped", + title: "Chat with your notes — Claude Desktop + MCP", + blurb: + "A Node.js CLI that ingests a folder of markdown notes into a SQLRite database, then exposes it to Claude Desktop (or any MCP client) via sqlrite-mcp --read-only. Claude calls bm25_search / vector_search / query directly against your local notes — no cloud sync, no custom RAG pipeline.", + bullets: [ + "Markdown → frontmatter-aware chunker → hash or OpenAI embedder → SQLRite documents + chunks tables", + "Hybrid retrieval fuses BM25 and vector cosine in a single SQL ORDER BY (see docs/fts.md)", + "`sqlrite-notes serve` wraps sqlrite-mcp so the Claude Desktop config snippet is one block of JSON", + "Default embedder is fully offline (zero-dep hash bag-of-words); flip to text-embedding-3-small with OPENAI_API_KEY", + "40 unit + integration tests; works against the prebuilt @joaoh82/sqlrite npm binaries", + ], + language: "Node.js 20+", + repoPath: "examples/nodejs-notes", + features: ["HNSW", "BM25 / FTS", "MCP server", "napi-rs SDK"], + }, ]; const pillStyle: React.CSSProperties = { @@ -230,11 +258,10 @@ export default function ExamplesIndexPage() { fontSize: 14, }} > - More examples in flight: a Node.js MCP-powered notes - assistant, a Tauri + Svelte journaling desktop app, a - browser SQL playground (WASM), and a Go edge/IoT event - collector. See /docs for the - engine reference. + More examples in flight: a Tauri + Svelte journaling + desktop app, a browser SQL playground (WASM), and a Go + edge/IoT event collector. See{" "} + /docs for the engine reference.