diff --git a/docs/superpowers/plans/2026-04-09-concept-dedup-and-update.md b/docs/superpowers/plans/2026-04-09-concept-dedup-and-update.md deleted file mode 100644 index 1a312a6..0000000 --- a/docs/superpowers/plans/2026-04-09-concept-dedup-and-update.md +++ /dev/null @@ -1,888 +0,0 @@ -# Concept Dedup & Existing Page Update — Implementation Plan - -> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. - -**Goal:** Give the compiler enough context about existing concepts to make smart dedup/update decisions, and add the ability to rewrite existing concept pages with new information — all without breaking prompt caching. - -**Architecture:** Extend the deterministic pipeline in `compiler.py` with: (1) concept briefs read from disk before the concepts-plan LLM call, (2) a new JSON output format with create/update/related actions, (3) a new concurrent "update" path that sends existing page content to the LLM for rewriting, (4) a code-only "related" path for cross-ref links. Extract shared logic between `compile_short_doc` and `compile_long_doc` into `_compile_concepts`. - -**Tech Stack:** Python, litellm, asyncio, pytest - ---- - -### Task 1: Add `_read_concept_briefs` and test - -**Files:** -- Modify: `openkb/agent/compiler.py:199-207` (File I/O helpers section) -- Modify: `tests/test_compiler.py:98-116` (TestReadWikiContext section) - -- [ ] **Step 1: Write the failing test** - -Add to `tests/test_compiler.py`: - -```python -from openkb.agent.compiler import _read_concept_briefs - -class TestReadConceptBriefs: - def test_empty_wiki(self, tmp_path): - wiki = tmp_path / "wiki" - wiki.mkdir() - assert _read_concept_briefs(wiki) == "(none yet)" - - def test_no_concepts_dir(self, tmp_path): - wiki = tmp_path / "wiki" - wiki.mkdir() - assert _read_concept_briefs(wiki) == "(none yet)" - - def test_reads_briefs_with_frontmatter(self, tmp_path): - wiki = tmp_path / "wiki" - concepts = wiki / "concepts" - concepts.mkdir(parents=True) - (concepts / "attention.md").write_text( - "---\nsources: [paper.pdf]\n---\n\nAttention allows models to focus on relevant input parts selectively.", - encoding="utf-8", - ) - result = _read_concept_briefs(wiki) - assert "- attention: Attention allows models" in result - - def test_reads_briefs_without_frontmatter(self, tmp_path): - wiki = tmp_path / "wiki" - concepts = wiki / "concepts" - concepts.mkdir(parents=True) - (concepts / "rnn.md").write_text( - "Recurrent neural networks process sequences step by step.", - encoding="utf-8", - ) - result = _read_concept_briefs(wiki) - assert "- rnn: Recurrent neural networks" in result - - def test_truncates_long_content(self, tmp_path): - wiki = tmp_path / "wiki" - concepts = wiki / "concepts" - concepts.mkdir(parents=True) - (concepts / "long.md").write_text("A" * 300, encoding="utf-8") - result = _read_concept_briefs(wiki) - brief_line = result.split("\n")[0] - # slug + ": " + 150 chars = well under 200 - assert len(brief_line) < 200 - - def test_sorted_alphabetically(self, tmp_path): - wiki = tmp_path / "wiki" - concepts = wiki / "concepts" - concepts.mkdir(parents=True) - (concepts / "zebra.md").write_text("Zebra concept.", encoding="utf-8") - (concepts / "alpha.md").write_text("Alpha concept.", encoding="utf-8") - result = _read_concept_briefs(wiki) - lines = result.strip().split("\n") - assert lines[0].startswith("- alpha:") - assert lines[1].startswith("- zebra:") -``` - -- [ ] **Step 2: Run test to verify it fails** - -Run: `pytest tests/test_compiler.py::TestReadConceptBriefs -v` -Expected: FAIL with `ImportError: cannot import name '_read_concept_briefs'` - -- [ ] **Step 3: Implement `_read_concept_briefs`** - -Add to `openkb/agent/compiler.py` in the File I/O helpers section (after `_read_wiki_context`): - -```python -def _read_concept_briefs(wiki_dir: Path) -> str: - """Read existing concept pages and return compact briefs for the LLM. - - Returns a string like: - - attention: Attention allows models to focus on relevant input parts... - - transformer: The Transformer is a neural network architecture... - - Or "(none yet)" if no concept pages exist. - """ - concepts_dir = wiki_dir / "concepts" - if not concepts_dir.exists(): - return "(none yet)" - briefs = [] - for p in sorted(concepts_dir.glob("*.md")): - text = p.read_text(encoding="utf-8") - # Skip YAML frontmatter - if text.startswith("---"): - parts = text.split("---", 2) - body = parts[2].strip() if len(parts) >= 3 else "" - else: - body = text.strip() - brief = body[:150].replace("\n", " ") - if brief: - briefs.append(f"- {p.stem}: {brief}") - return "\n".join(briefs) or "(none yet)" -``` - -- [ ] **Step 4: Run test to verify it passes** - -Run: `pytest tests/test_compiler.py::TestReadConceptBriefs -v` -Expected: All 6 tests PASS - -- [ ] **Step 5: Update the import in test file** - -Add `_read_concept_briefs` to the existing import block at the top of `tests/test_compiler.py`: - -```python -from openkb.agent.compiler import ( - compile_long_doc, - compile_short_doc, - _parse_json, - _write_summary, - _write_concept, - _update_index, - _read_wiki_context, - _read_concept_briefs, -) -``` - -- [ ] **Step 6: Commit** - -```bash -git add openkb/agent/compiler.py tests/test_compiler.py -git commit -m "feat: add _read_concept_briefs for concept dedup context" -``` - ---- - -### Task 2: Replace prompt template and update JSON parsing - -**Files:** -- Modify: `openkb/agent/compiler.py:53-70` (prompt templates section) -- Modify: `tests/test_compiler.py:21-31` (TestParseJson section) - -- [ ] **Step 1: Write the failing test for new JSON format** - -Add to `tests/test_compiler.py`: - -```python -class TestParseConceptsPlan: - def test_dict_format(self): - text = json.dumps({ - "create": [{"name": "foo", "title": "Foo"}], - "update": [{"name": "bar", "title": "Bar"}], - "related": ["baz"], - }) - parsed = _parse_json(text) - assert isinstance(parsed, dict) - assert len(parsed["create"]) == 1 - assert len(parsed["update"]) == 1 - assert parsed["related"] == ["baz"] - - def test_fallback_list_format(self): - """If LLM returns old flat array, _parse_json still works.""" - text = json.dumps([{"name": "foo", "title": "Foo"}]) - parsed = _parse_json(text) - assert isinstance(parsed, list) - - def test_fenced_dict(self): - text = '```json\n{"create": [], "update": [], "related": []}\n```' - parsed = _parse_json(text) - assert isinstance(parsed, dict) - assert parsed["create"] == [] -``` - -- [ ] **Step 2: Run test to verify it passes (these use existing `_parse_json`)** - -Run: `pytest tests/test_compiler.py::TestParseConceptsPlan -v` -Expected: All 3 PASS — `_parse_json` already handles dicts. This confirms compatibility. - -- [ ] **Step 3: Replace `_CONCEPTS_LIST_USER` with `_CONCEPTS_PLAN_USER`** - -In `openkb/agent/compiler.py`, replace the `_CONCEPTS_LIST_USER` template (lines 53-70) with: - -```python -_CONCEPTS_PLAN_USER = """\ -Based on the summary above, decide how to update the wiki's concept pages. - -Existing concept pages: -{concept_briefs} - -Return a JSON object with three keys: - -1. "create" — new concepts not covered by any existing page. Array of objects: - {{"name": "concept-slug", "title": "Human-Readable Title"}} - -2. "update" — existing concepts that have significant new information from \ -this document worth integrating. Array of objects: - {{"name": "existing-slug", "title": "Existing Title"}} - -3. "related" — existing concepts tangentially related to this document but \ -not needing content changes, just a cross-reference link. Array of slug strings. - -Rules: -- For the first few documents, create 2-3 foundational concepts at most. -- Do NOT create a concept that overlaps with an existing one — use "update". -- Do NOT create concepts that are just the document topic itself. -- "related" is for lightweight cross-linking only, no content rewrite needed. - -Return ONLY valid JSON, no fences, no explanation. -""" -``` - -- [ ] **Step 4: Add `_CONCEPT_UPDATE_USER` template** - -Add after `_CONCEPT_PAGE_USER` (after line 82): - -```python -_CONCEPT_UPDATE_USER = """\ -Update the concept page for: {title} - -Current content of this page: -{existing_content} - -New information from document "{doc_name}" (summarized above) should be \ -integrated into this page. Rewrite the full page incorporating the new \ -information naturally — do not just append. Maintain existing \ -[[wikilinks]] and add new ones where appropriate. - -Return ONLY the Markdown content (no frontmatter, no code fences). -""" -``` - -- [ ] **Step 5: Run all existing tests to verify nothing breaks** - -Run: `pytest tests/test_compiler.py -v` -Expected: All PASS (templates aren't tested directly, only via integration tests which we'll update later) - -- [ ] **Step 6: Commit** - -```bash -git add openkb/agent/compiler.py tests/test_compiler.py -git commit -m "feat: add concepts plan and update prompt templates" -``` - ---- - -### Task 3: Add `_add_related_link` and test - -**Files:** -- Modify: `openkb/agent/compiler.py` (File I/O helpers section, after `_write_concept`) -- Modify: `tests/test_compiler.py` - -- [ ] **Step 1: Write the failing test** - -Add to `tests/test_compiler.py`: - -```python -from openkb.agent.compiler import _add_related_link - -class TestAddRelatedLink: - def test_adds_see_also_link(self, tmp_path): - wiki = tmp_path / "wiki" - concepts = wiki / "concepts" - concepts.mkdir(parents=True) - (concepts / "attention.md").write_text( - "---\nsources: [paper1.pdf]\n---\n\n# Attention\n\nSome content.", - encoding="utf-8", - ) - _add_related_link(wiki, "attention", "new-doc", "paper2.pdf") - text = (concepts / "attention.md").read_text() - assert "[[summaries/new-doc]]" in text - assert "paper2.pdf" in text - - def test_skips_if_already_linked(self, tmp_path): - wiki = tmp_path / "wiki" - concepts = wiki / "concepts" - concepts.mkdir(parents=True) - (concepts / "attention.md").write_text( - "---\nsources: [paper1.pdf]\n---\n\n# Attention\n\nSee also: [[summaries/new-doc]]", - encoding="utf-8", - ) - _add_related_link(wiki, "attention", "new-doc", "paper1.pdf") - text = (concepts / "attention.md").read_text() - # Should not duplicate - assert text.count("[[summaries/new-doc]]") == 1 - - def test_skips_if_file_missing(self, tmp_path): - wiki = tmp_path / "wiki" - wiki.mkdir() - # Should not raise - _add_related_link(wiki, "nonexistent", "doc", "file.pdf") -``` - -- [ ] **Step 2: Run test to verify it fails** - -Run: `pytest tests/test_compiler.py::TestAddRelatedLink -v` -Expected: FAIL with `ImportError: cannot import name '_add_related_link'` - -- [ ] **Step 3: Implement `_add_related_link`** - -Add to `openkb/agent/compiler.py` after `_write_concept`: - -```python -def _add_related_link(wiki_dir: Path, concept_slug: str, doc_name: str, source_file: str) -> None: - """Add a cross-reference link to an existing concept page (no LLM call).""" - concepts_dir = wiki_dir / "concepts" - path = concepts_dir / f"{concept_slug}.md" - if not path.exists(): - return - - text = path.read_text(encoding="utf-8") - link = f"[[summaries/{doc_name}]]" - if link in text: - return - - # Update sources in frontmatter - if source_file not in text: - if text.startswith("---"): - end = text.index("---", 3) - fm = text[:end + 3] - body = text[end + 3:] - if "sources:" in fm: - fm = fm.replace("sources: [", f"sources: [{source_file}, ") - else: - fm = fm.replace("---\n", f"---\nsources: [{source_file}]\n", 1) - text = fm + body - else: - text = f"---\nsources: [{source_file}]\n---\n\n" + text - - text += f"\n\nSee also: {link}" - path.write_text(text, encoding="utf-8") -``` - -- [ ] **Step 4: Run test to verify it passes** - -Run: `pytest tests/test_compiler.py::TestAddRelatedLink -v` -Expected: All 3 tests PASS - -- [ ] **Step 5: Update the import in test file** - -Add `_add_related_link` to the import block at top of `tests/test_compiler.py`. - -- [ ] **Step 6: Commit** - -```bash -git add openkb/agent/compiler.py tests/test_compiler.py -git commit -m "feat: add _add_related_link for code-only cross-referencing" -``` - ---- - -### Task 4: Extract `_compile_concepts` and refactor both public functions - -**Files:** -- Modify: `openkb/agent/compiler.py:290-509` (Public API section — full rewrite) -- Modify: `tests/test_compiler.py:153-267` (integration tests) - -This is the core task. It extracts the shared Steps 2-4 into `_compile_concepts`, updates both public functions to call it, and switches to the new concepts plan format. - -- [ ] **Step 1: Write integration test for new create/update/related flow** - -Add to `tests/test_compiler.py`: - -```python -class TestCompileConceptsPlan: - """Integration tests for the new create/update/related flow.""" - - @pytest.mark.asyncio - async def test_create_and_update_flow(self, tmp_path): - """New doc creates one concept and updates an existing one.""" - wiki = tmp_path / "wiki" - (wiki / "sources").mkdir(parents=True) - (wiki / "summaries").mkdir(parents=True) - concepts_dir = wiki / "concepts" - concepts_dir.mkdir(parents=True) - (wiki / "index.md").write_text( - "# Index\n\n## Documents\n\n## Concepts\n\n## Explorations\n", - encoding="utf-8", - ) - # Pre-existing concept - (concepts_dir / "attention.md").write_text( - "---\nsources: [old-paper.pdf]\n---\n\n# Attention\n\nOld content about attention.", - encoding="utf-8", - ) - - source_path = wiki / "sources" / "new-paper.md" - source_path.write_text("# New Paper\n\nContent about flash attention and transformers.", encoding="utf-8") - (tmp_path / ".openkb").mkdir() - (tmp_path / "raw").mkdir() - (tmp_path / "raw" / "new-paper.pdf").write_bytes(b"fake") - - summary_resp = "This paper introduces flash attention, improving on attention mechanisms." - plan_resp = json.dumps({ - "create": [{"name": "flash-attention", "title": "Flash Attention"}], - "update": [{"name": "attention", "title": "Attention Mechanism"}], - "related": [], - }) - create_page_resp = "# Flash Attention\n\nAn efficient attention algorithm." - update_page_resp = "# Attention\n\nUpdated content with flash attention details." - - with patch("openkb.agent.compiler.litellm") as mock_litellm: - mock_litellm.completion = MagicMock( - side_effect=_mock_completion([summary_resp, plan_resp]) - ) - mock_litellm.acompletion = AsyncMock( - side_effect=_mock_acompletion([create_page_resp, update_page_resp]) - ) - await compile_short_doc("new-paper", source_path, tmp_path, "gpt-4o-mini") - - # New concept created - flash_path = concepts_dir / "flash-attention.md" - assert flash_path.exists() - assert "sources: [new-paper.pdf]" in flash_path.read_text() - - # Existing concept rewritten (not appended) - attn_text = (concepts_dir / "attention.md").read_text() - assert "new-paper.pdf" in attn_text - assert "Updated content with flash attention details" in attn_text - - # Index updated for both - index_text = (wiki / "index.md").read_text() - assert "[[concepts/flash-attention]]" in index_text - - @pytest.mark.asyncio - async def test_related_adds_link_no_llm(self, tmp_path): - """Related concepts get cross-ref links without LLM calls.""" - wiki = tmp_path / "wiki" - (wiki / "sources").mkdir(parents=True) - (wiki / "summaries").mkdir(parents=True) - concepts_dir = wiki / "concepts" - concepts_dir.mkdir(parents=True) - (wiki / "index.md").write_text( - "# Index\n\n## Documents\n\n## Concepts\n\n## Explorations\n", - encoding="utf-8", - ) - (concepts_dir / "transformer.md").write_text( - "---\nsources: [old.pdf]\n---\n\n# Transformer\n\nArchitecture details.", - encoding="utf-8", - ) - - source_path = wiki / "sources" / "doc.md" - source_path.write_text("Content", encoding="utf-8") - (tmp_path / ".openkb").mkdir() - (tmp_path / "raw").mkdir() - (tmp_path / "raw" / "doc.pdf").write_bytes(b"fake") - - summary_resp = "A short summary." - plan_resp = json.dumps({ - "create": [], - "update": [], - "related": ["transformer"], - }) - - with patch("openkb.agent.compiler.litellm") as mock_litellm: - mock_litellm.completion = MagicMock( - side_effect=_mock_completion([summary_resp, plan_resp]) - ) - # acompletion should NOT be called (no create/update) - mock_litellm.acompletion = AsyncMock(side_effect=AssertionError("should not be called")) - await compile_short_doc("doc", source_path, tmp_path, "gpt-4o-mini") - - # Related concept should have cross-ref link - transformer_text = (concepts_dir / "transformer.md").read_text() - assert "[[summaries/doc]]" in transformer_text - - @pytest.mark.asyncio - async def test_fallback_list_format(self, tmp_path): - """If LLM returns old flat array, treat all as create.""" - wiki = tmp_path / "wiki" - (wiki / "sources").mkdir(parents=True) - (wiki / "summaries").mkdir(parents=True) - (wiki / "concepts").mkdir(parents=True) - (wiki / "index.md").write_text( - "# Index\n\n## Documents\n\n## Concepts\n\n## Explorations\n", - encoding="utf-8", - ) - source_path = wiki / "sources" / "doc.md" - source_path.write_text("Content", encoding="utf-8") - (tmp_path / ".openkb").mkdir() - (tmp_path / "raw").mkdir() - (tmp_path / "raw" / "doc.pdf").write_bytes(b"fake") - - summary_resp = "Summary." - # Old format: flat array - plan_resp = json.dumps([{"name": "foo", "title": "Foo"}]) - page_resp = "# Foo\n\nContent." - - with patch("openkb.agent.compiler.litellm") as mock_litellm: - mock_litellm.completion = MagicMock( - side_effect=_mock_completion([summary_resp, plan_resp]) - ) - mock_litellm.acompletion = AsyncMock( - side_effect=_mock_acompletion([page_resp]) - ) - await compile_short_doc("doc", source_path, tmp_path, "gpt-4o-mini") - - assert (wiki / "concepts" / "foo.md").exists() -``` - -- [ ] **Step 2: Run the new tests to verify they fail** - -Run: `pytest tests/test_compiler.py::TestCompileConceptsPlan -v` -Expected: FAIL — the current code uses old prompt format and doesn't handle dict responses - -- [ ] **Step 3: Implement `_compile_concepts` and refactor public functions** - -Replace the entire Public API section (from `DEFAULT_COMPILE_CONCURRENCY` to end of file) in `openkb/agent/compiler.py` with: - -```python -DEFAULT_COMPILE_CONCURRENCY = 5 - - -async def _compile_concepts( - wiki_dir: Path, - kb_dir: Path, - model: str, - system_msg: dict, - doc_msg: dict, - summary: str, - doc_name: str, - max_concurrency: int = DEFAULT_COMPILE_CONCURRENCY, -) -> None: - """Shared concept compilation logic: plan → create/update/related → index. - - This is the core of the compilation pipeline, shared by both - compile_short_doc and compile_long_doc. - """ - source_file = _find_source_filename(doc_name, kb_dir) - concept_briefs = _read_concept_briefs(wiki_dir) - - # --- Concepts plan (A cached) --- - plan_raw = _llm_call(model, [ - system_msg, - doc_msg, - {"role": "assistant", "content": summary}, - {"role": "user", "content": _CONCEPTS_PLAN_USER.format( - concept_briefs=concept_briefs, - )}, - ], "concepts-plan", max_tokens=1024) - - try: - parsed = _parse_json(plan_raw) - except (json.JSONDecodeError, ValueError) as exc: - logger.warning("Failed to parse concepts plan: %s", exc) - logger.debug("Raw: %s", plan_raw) - _update_index(wiki_dir, doc_name, []) - return - - # Fallback: if LLM returns flat array, treat all as create - if isinstance(parsed, list): - create_list, update_list, related_list = parsed, [], [] - else: - create_list = parsed.get("create", []) - update_list = parsed.get("update", []) - related_list = parsed.get("related", []) - - if not create_list and not update_list and not related_list: - _update_index(wiki_dir, doc_name, []) - return - - # --- Concurrent concept generation (A cached) --- - semaphore = asyncio.Semaphore(max_concurrency) - - async def _gen_create(concept: dict) -> tuple[str, str, bool]: - name = concept["name"] - title = concept.get("title", name) - async with semaphore: - page_content = await _llm_call_async(model, [ - system_msg, - doc_msg, - {"role": "assistant", "content": summary}, - {"role": "user", "content": _CONCEPT_PAGE_USER.format( - title=title, doc_name=doc_name, - update_instruction="", - )}, - ], f"create:{name}") - return name, page_content, False - - async def _gen_update(concept: dict) -> tuple[str, str, bool]: - name = concept["name"] - title = concept.get("title", name) - # Read existing page content for the LLM to integrate - concept_path = wiki_dir / "concepts" / f"{name}.md" - if concept_path.exists(): - raw_text = concept_path.read_text(encoding="utf-8") - # Strip frontmatter for the LLM - if raw_text.startswith("---"): - parts = raw_text.split("---", 2) - existing_content = parts[2].strip() if len(parts) >= 3 else raw_text - else: - existing_content = raw_text - else: - existing_content = "(page not found — create from scratch)" - async with semaphore: - page_content = await _llm_call_async(model, [ - system_msg, - doc_msg, - {"role": "assistant", "content": summary}, - {"role": "user", "content": _CONCEPT_UPDATE_USER.format( - title=title, doc_name=doc_name, - existing_content=existing_content, - )}, - ], f"update:{name}") - return name, page_content, True - - tasks = [] - tasks.extend(_gen_create(c) for c in create_list) - tasks.extend(_gen_update(c) for c in update_list) - - if tasks: - total = len(tasks) - sys.stdout.write(f" Generating {total} concept(s) (concurrency={max_concurrency})...\n") - sys.stdout.flush() - - results = await asyncio.gather(*tasks, return_exceptions=True) - else: - results = [] - - concept_names = [] - for r in results: - if isinstance(r, Exception): - logger.warning("Concept generation failed: %s", r) - continue - name, page_content, is_update = r - _write_concept(wiki_dir, name, page_content, source_file, is_update) - concept_names.append(name) - - # --- Related: code-only cross-ref links --- - for slug in related_list: - _add_related_link(wiki_dir, slug, doc_name, source_file) - - # --- Update index --- - _update_index(wiki_dir, doc_name, concept_names) - - -async def compile_short_doc( - doc_name: str, - source_path: Path, - kb_dir: Path, - model: str, - max_concurrency: int = DEFAULT_COMPILE_CONCURRENCY, -) -> None: - """Compile a short document into wiki pages. - - Step 1: Generate summary from full document text. - Step 2: Plan + generate/update concept pages (via _compile_concepts). - """ - from openkb.config import load_config - - openkb_dir = kb_dir / ".openkb" - config = load_config(openkb_dir / "config.yaml") - language: str = config.get("language", "en") - - wiki_dir = kb_dir / "wiki" - schema_md = get_agents_md(wiki_dir) - source_file = _find_source_filename(doc_name, kb_dir) - content = source_path.read_text(encoding="utf-8") - - system_msg = {"role": "system", "content": _SYSTEM_TEMPLATE.format( - schema_md=schema_md, language=language, - )} - doc_msg = {"role": "user", "content": _SUMMARY_USER.format( - doc_name=doc_name, content=content, - )} - - # Step 1: Generate summary - summary = _llm_call(model, [system_msg, doc_msg], "summary") - _write_summary(wiki_dir, doc_name, source_file, summary) - - # Step 2: Compile concepts - await _compile_concepts( - wiki_dir, kb_dir, model, system_msg, doc_msg, summary, - doc_name, max_concurrency, - ) - - -async def compile_long_doc( - doc_name: str, - summary_path: Path, - doc_id: str, - kb_dir: Path, - model: str, - max_concurrency: int = DEFAULT_COMPILE_CONCURRENCY, -) -> None: - """Compile a long (PageIndex) document into wiki concept pages. - - The summary page is already written by the indexer. This function - generates an overview, then plans + generates/updates concept pages. - """ - from openkb.config import load_config - - openkb_dir = kb_dir / ".openkb" - config = load_config(openkb_dir / "config.yaml") - language: str = config.get("language", "en") - - wiki_dir = kb_dir / "wiki" - schema_md = get_agents_md(wiki_dir) - summary_text = summary_path.read_text(encoding="utf-8") - - system_msg = {"role": "system", "content": _SYSTEM_TEMPLATE.format( - schema_md=schema_md, language=language, - )} - doc_msg = {"role": "user", "content": _LONG_DOC_SUMMARY_USER.format( - doc_name=doc_name, doc_id=doc_id, content=summary_text, - )} - - # Step 1: Generate overview - overview = _llm_call(model, [system_msg, doc_msg], "overview") - - # Step 2: Compile concepts - await _compile_concepts( - wiki_dir, kb_dir, model, system_msg, doc_msg, overview, - doc_name, max_concurrency, - ) -``` - -- [ ] **Step 4: Update existing integration tests** - -Update `TestCompileShortDoc.test_full_pipeline` — the concepts-list response now needs to be the new dict format: - -```python -class TestCompileShortDoc: - @pytest.mark.asyncio - async def test_full_pipeline(self, tmp_path): - wiki = tmp_path / "wiki" - (wiki / "sources").mkdir(parents=True) - (wiki / "summaries").mkdir(parents=True) - (wiki / "concepts").mkdir(parents=True) - (wiki / "index.md").write_text( - "# Index\n\n## Documents\n\n## Concepts\n\n## Explorations\n", - encoding="utf-8", - ) - source_path = wiki / "sources" / "test-doc.md" - source_path.write_text("# Test Doc\n\nSome content about transformers.", encoding="utf-8") - (tmp_path / ".openkb").mkdir() - (tmp_path / "raw").mkdir() - (tmp_path / "raw" / "test-doc.pdf").write_bytes(b"fake") - - summary_response = "# Summary\n\nThis document discusses transformers." - plan_response = json.dumps({ - "create": [{"name": "transformer", "title": "Transformer"}], - "update": [], - "related": [], - }) - concept_page_response = "# Transformer\n\nA neural network architecture." - - with patch("openkb.agent.compiler.litellm") as mock_litellm: - mock_litellm.completion = MagicMock( - side_effect=_mock_completion([summary_response, plan_response]) - ) - mock_litellm.acompletion = AsyncMock( - side_effect=_mock_acompletion([concept_page_response]) - ) - await compile_short_doc("test-doc", source_path, tmp_path, "gpt-4o-mini") - - summary_path = wiki / "summaries" / "test-doc.md" - assert summary_path.exists() - assert "sources: [test-doc.pdf]" in summary_path.read_text() - - concept_path = wiki / "concepts" / "transformer.md" - assert concept_path.exists() - assert "sources: [test-doc.pdf]" in concept_path.read_text() - - index_text = (wiki / "index.md").read_text() - assert "[[summaries/test-doc]]" in index_text - assert "[[concepts/transformer]]" in index_text -``` - -Update `TestCompileShortDoc.test_handles_bad_json` — no changes needed (bad JSON still triggers fallback). - -Update `TestCompileLongDoc.test_full_pipeline`: - -```python -class TestCompileLongDoc: - @pytest.mark.asyncio - async def test_full_pipeline(self, tmp_path): - wiki = tmp_path / "wiki" - (wiki / "summaries").mkdir(parents=True) - (wiki / "concepts").mkdir(parents=True) - (wiki / "index.md").write_text( - "# Index\n\n## Documents\n\n## Concepts\n", - encoding="utf-8", - ) - summary_path = wiki / "summaries" / "big-doc.md" - summary_path.write_text("# Big Doc\n\nPageIndex summary tree.", encoding="utf-8") - openkb_dir = tmp_path / ".openkb" - openkb_dir.mkdir() - (openkb_dir / "config.yaml").write_text("model: gpt-4o-mini\n") - (tmp_path / "raw").mkdir() - (tmp_path / "raw" / "big-doc.pdf").write_bytes(b"fake") - - overview_response = "Overview of the big document." - plan_response = json.dumps({ - "create": [{"name": "deep-learning", "title": "Deep Learning"}], - "update": [], - "related": [], - }) - concept_page_response = "# Deep Learning\n\nA subfield of ML." - - with patch("openkb.agent.compiler.litellm") as mock_litellm: - mock_litellm.completion = MagicMock( - side_effect=_mock_completion([overview_response, plan_response]) - ) - mock_litellm.acompletion = AsyncMock( - side_effect=_mock_acompletion([concept_page_response]) - ) - await compile_long_doc( - "big-doc", summary_path, "doc-123", tmp_path, "gpt-4o-mini" - ) - - concept_path = wiki / "concepts" / "deep-learning.md" - assert concept_path.exists() - assert "Deep Learning" in concept_path.read_text() - - index_text = (wiki / "index.md").read_text() - assert "[[summaries/big-doc]]" in index_text - assert "[[concepts/deep-learning]]" in index_text -``` - -- [ ] **Step 5: Run all tests** - -Run: `pytest tests/test_compiler.py -v` -Expected: All PASS - -- [ ] **Step 6: Run the full test suite** - -Run: `pytest tests/ -v` -Expected: All 149+ tests PASS - -- [ ] **Step 7: Commit** - -```bash -git add openkb/agent/compiler.py tests/test_compiler.py -git commit -m "feat: concept dedup with briefs, update/related paths, extract _compile_concepts" -``` - ---- - -### Task 5: Clean up old references and update module docstring - -**Files:** -- Modify: `openkb/agent/compiler.py:1-9` (module docstring) - -- [ ] **Step 1: Update module docstring** - -Replace the docstring at the top of `openkb/agent/compiler.py`: - -```python -"""Wiki compilation pipeline for OpenKB. - -Pipeline leveraging LLM prompt caching: - Step 1: Build base context A (schema + document content). - Step 2: A → generate summary. - Step 3: A + summary → concepts plan (create/update/related). - Step 4: Concurrent LLM calls (A cached) → generate new + rewrite updated concepts. - Step 5: Code adds cross-ref links to related concepts, updates index. -""" -``` - -- [ ] **Step 2: Verify `_CONCEPTS_LIST_USER` is fully removed** - -Search for any remaining references to `_CONCEPTS_LIST_USER` in the codebase: - -Run: `grep -r "_CONCEPTS_LIST_USER" openkb/ tests/` -Expected: No matches - -- [ ] **Step 3: Run full test suite one final time** - -Run: `pytest tests/ -q` -Expected: All tests pass - -- [ ] **Step 4: Commit** - -```bash -git add openkb/agent/compiler.py -git commit -m "chore: update compiler docstring for new pipeline" -``` diff --git a/docs/superpowers/plans/2026-04-09-retrieve-redesign.md b/docs/superpowers/plans/2026-04-09-retrieve-redesign.md deleted file mode 100644 index 3c659bc..0000000 --- a/docs/superpowers/plans/2026-04-09-retrieve-redesign.md +++ /dev/null @@ -1,1104 +0,0 @@ -# Retrieve Redesign Implementation Plan - -> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. - -**Goal:** Unify query across long/short docs, add brief summaries to index.md and frontmatter, store long doc sources as JSON with per-page access. - -**Architecture:** (1) LLM prompts return `{"brief", "content"}` JSON — briefs flow into frontmatter and index.md. (2) Indexer stores long doc pages as JSON array. (3) New `get_page_content` tool replaces `pageindex_retrieve`. (4) Query agent uses same tools for all docs. - -**Tech Stack:** Python, litellm, asyncio, pytest - ---- - -### Task 1: Add `get_page_content` tool and `parse_pages` helper - -**Files:** -- Modify: `openkb/agent/tools.py` -- Modify: `tests/test_agent_tools.py` - -- [ ] **Step 1: Write failing tests** - -Add to `tests/test_agent_tools.py`: - -```python -from openkb.agent.tools import get_page_content, parse_pages - -class TestParsePages: - def test_single_page(self): - assert parse_pages("3") == [3] - - def test_range(self): - assert parse_pages("3-5") == [3, 4, 5] - - def test_comma_separated(self): - assert parse_pages("1,3,5") == [1, 3, 5] - - def test_mixed(self): - assert parse_pages("1-3,7,10-12") == [1, 2, 3, 7, 10, 11, 12] - - def test_deduplication(self): - assert parse_pages("3,3,3") == [3] - - def test_sorted(self): - assert parse_pages("5,1,3") == [1, 3, 5] - - def test_ignores_zero_and_negative(self): - assert parse_pages("0,-1,3") == [3] - - -class TestGetPageContent: - def test_reads_pages_from_json(self, tmp_path): - import json - wiki_root = str(tmp_path) - sources = tmp_path / "sources" - sources.mkdir() - pages = [ - {"page": 1, "content": "Page one text."}, - {"page": 2, "content": "Page two text."}, - {"page": 3, "content": "Page three text."}, - ] - (sources / "paper.json").write_text(json.dumps(pages), encoding="utf-8") - - result = get_page_content("paper", "1,3", wiki_root) - assert "[Page 1]" in result - assert "Page one text." in result - assert "[Page 3]" in result - assert "Page three text." in result - assert "Page two" not in result - - def test_returns_error_for_missing_file(self, tmp_path): - wiki_root = str(tmp_path) - (tmp_path / "sources").mkdir() - result = get_page_content("nonexistent", "1", wiki_root) - assert "not found" in result.lower() - - def test_returns_error_for_no_matching_pages(self, tmp_path): - import json - wiki_root = str(tmp_path) - sources = tmp_path / "sources" - sources.mkdir() - pages = [{"page": 1, "content": "Only page."}] - (sources / "paper.json").write_text(json.dumps(pages), encoding="utf-8") - - result = get_page_content("paper", "99", wiki_root) - assert "no content" in result.lower() or result.strip() == "" - - def test_includes_images_info(self, tmp_path): - import json - wiki_root = str(tmp_path) - sources = tmp_path / "sources" - sources.mkdir() - pages = [ - {"page": 1, "content": "Text.", "images": [{"path": "images/p/img.png", "width": 100, "height": 80}]}, - ] - (sources / "doc.json").write_text(json.dumps(pages), encoding="utf-8") - - result = get_page_content("doc", "1", wiki_root) - assert "img.png" in result - - def test_path_escape_denied(self, tmp_path): - wiki_root = str(tmp_path) - (tmp_path / "sources").mkdir() - result = get_page_content("../../etc/passwd", "1", wiki_root) - assert "denied" in result.lower() or "not found" in result.lower() -``` - -- [ ] **Step 2: Run tests to verify they fail** - -Run: `pytest tests/test_agent_tools.py::TestParsePages tests/test_agent_tools.py::TestGetPageContent -v` -Expected: FAIL with `ImportError` - -- [ ] **Step 3: Implement `parse_pages` and `get_page_content`** - -Add to `openkb/agent/tools.py`: - -```python -import json as _json - - -def parse_pages(pages: str) -> list[int]: - """Parse a page specification like '3-5,7,10-12' into a sorted list of ints.""" - result: set[int] = set() - for part in pages.split(","): - part = part.strip() - if "-" in part: - start_str, end_str = part.split("-", 1) - try: - start, end = int(start_str), int(end_str) - result.update(range(start, end + 1)) - except ValueError: - continue - else: - try: - result.add(int(part)) - except ValueError: - continue - return sorted(n for n in result if n >= 1) - - -def get_page_content(doc_name: str, pages: str, wiki_root: str) -> str: - """Get text content of specific pages from a long document. - - Reads from ``wiki/sources/{doc_name}.json`` which contains a JSON array - of ``{"page": int, "content": str, "images": [...]}`` objects. - - Args: - doc_name: Document name (stem, e.g. ``'attention-is-all-you-need'``). - pages: Page specification (e.g. ``'3-5,7,10-12'``). - wiki_root: Absolute path to the wiki root directory. - - Returns: - Formatted text of requested pages, or error message if not found. - """ - root = Path(wiki_root).resolve() - json_path = (root / "sources" / f"{doc_name}.json").resolve() - if not json_path.is_relative_to(root): - return "Access denied: path escapes wiki root." - if not json_path.exists(): - return f"Document not found: {doc_name}. No sources/{doc_name}.json file." - - data = _json.loads(json_path.read_text(encoding="utf-8")) - page_nums = set(parse_pages(pages)) - matched = [p for p in data if p["page"] in page_nums] - - if not matched: - return f"No content found for pages: {pages}" - - parts: list[str] = [] - for p in matched: - header = f"[Page {p['page']}]" - text = p.get("content", "") - if "images" in p: - img_refs = ", ".join(img["path"] for img in p["images"]) - text += f"\n[Images: {img_refs}]" - parts.append(f"{header}\n{text}") - - return "\n\n".join(parts) -``` - -- [ ] **Step 4: Run tests to verify they pass** - -Run: `pytest tests/test_agent_tools.py -v` -Expected: All PASS - -- [ ] **Step 5: Commit** - -```bash -git add openkb/agent/tools.py tests/test_agent_tools.py -git commit -m "feat: add get_page_content tool and parse_pages helper" -``` - ---- - -### Task 2: Change LLM prompts to return `{"brief", "content"}` JSON - -**Files:** -- Modify: `openkb/agent/compiler.py` (prompt templates, lines 40-105) -- Modify: `tests/test_compiler.py` (TestParseConceptsPlan) - -- [ ] **Step 1: Write test for brief+content JSON parsing** - -Add to `tests/test_compiler.py`: - -```python -class TestParseBriefContent: - def test_dict_with_brief_and_content(self): - text = json.dumps({"brief": "A short desc", "content": "# Full page\n\nDetails."}) - parsed = _parse_json(text) - assert parsed["brief"] == "A short desc" - assert "# Full page" in parsed["content"] - - def test_plain_text_fallback(self): - """If LLM returns plain text, _parse_json raises — caller handles fallback.""" - with pytest.raises((json.JSONDecodeError, ValueError)): - _parse_json("Just plain markdown text without JSON") -``` - -- [ ] **Step 2: Run test to verify it passes (existing _parse_json handles dicts)** - -Run: `pytest tests/test_compiler.py::TestParseBriefContent -v` -Expected: PASS — `_parse_json` already handles dicts - -- [ ] **Step 3: Update `_SUMMARY_USER` prompt** - -Replace in `openkb/agent/compiler.py`: - -```python -_SUMMARY_USER = """\ -New document: {doc_name} - -Full text: -{content} - -Write a summary page for this document in Markdown. - -Return a JSON object with two keys: -- "brief": A single sentence (under 100 chars) describing the document's main contribution -- "content": The full summary in Markdown. Include key concepts, findings, ideas, \ -and [[wikilinks]] to concepts that could become cross-document concept pages - -Return ONLY valid JSON, no fences. -""" -``` - -- [ ] **Step 4: Update `_CONCEPT_PAGE_USER` prompt** - -Replace in `openkb/agent/compiler.py`: - -```python -_CONCEPT_PAGE_USER = """\ -Write the concept page for: {title} - -This concept relates to the document "{doc_name}" summarized above. -{update_instruction} - -Return a JSON object with two keys: -- "brief": A single sentence (under 100 chars) defining this concept -- "content": The full concept page in Markdown. Include clear explanation, \ -key details from the source document, and [[wikilinks]] to related concepts \ -and [[summaries/{doc_name}]] - -Return ONLY valid JSON, no fences. -""" -``` - -- [ ] **Step 5: Update `_CONCEPT_UPDATE_USER` prompt** - -Replace in `openkb/agent/compiler.py`: - -```python -_CONCEPT_UPDATE_USER = """\ -Update the concept page for: {title} - -Current content of this page: -{existing_content} - -New information from document "{doc_name}" (summarized above) should be \ -integrated into this page. Rewrite the full page incorporating the new \ -information naturally — do not just append. Maintain existing \ -[[wikilinks]] and add new ones where appropriate. - -Return a JSON object with two keys: -- "brief": A single sentence (under 100 chars) defining this concept (may differ from before) -- "content": The rewritten full concept page in Markdown - -Return ONLY valid JSON, no fences. -""" -``` - -- [ ] **Step 6: Run all tests (prompts aren't tested directly)** - -Run: `pytest tests/test_compiler.py -v` -Expected: All PASS - -- [ ] **Step 7: Commit** - -```bash -git add openkb/agent/compiler.py tests/test_compiler.py -git commit -m "feat: update LLM prompts to return brief+content JSON" -``` - ---- - -### Task 3: Update `_write_summary` and `_write_concept` to store `brief` in frontmatter - -**Files:** -- Modify: `openkb/agent/compiler.py` (lines 274-320, `_write_summary` and `_write_concept`) -- Modify: `tests/test_compiler.py` - -- [ ] **Step 1: Write failing tests** - -Update existing and add new tests in `tests/test_compiler.py`: - -```python -class TestWriteSummary: - def test_writes_with_frontmatter(self, tmp_path): - wiki = tmp_path / "wiki" - wiki.mkdir() - _write_summary(wiki, "my-doc", "my-doc.pdf", "# Summary\n\nContent here.", brief="Introduces transformers") - path = wiki / "summaries" / "my-doc.md" - assert path.exists() - text = path.read_text() - assert "sources: [my-doc.pdf]" in text - assert "brief: Introduces transformers" in text - assert "# Summary" in text - - def test_writes_without_brief(self, tmp_path): - wiki = tmp_path / "wiki" - wiki.mkdir() - _write_summary(wiki, "my-doc", "my-doc.pdf", "# Summary\n\nContent here.") - path = wiki / "summaries" / "my-doc.md" - text = path.read_text() - assert "sources: [my-doc.pdf]" in text - assert "brief:" not in text -``` - -Update `TestWriteConcept`: - -```python -class TestWriteConcept: - def test_new_concept_with_brief(self, tmp_path): - wiki = tmp_path / "wiki" - wiki.mkdir() - _write_concept(wiki, "attention", "# Attention\n\nDetails.", "paper.pdf", False, brief="Mechanism for selective focus") - path = wiki / "concepts" / "attention.md" - assert path.exists() - text = path.read_text() - assert "sources: [paper.pdf]" in text - assert "brief: Mechanism for selective focus" in text - assert "# Attention" in text - - def test_new_concept(self, tmp_path): - wiki = tmp_path / "wiki" - wiki.mkdir() - _write_concept(wiki, "attention", "# Attention\n\nDetails.", "paper.pdf", False) - path = wiki / "concepts" / "attention.md" - assert path.exists() - text = path.read_text() - assert "sources: [paper.pdf]" in text - assert "# Attention" in text - - def test_update_concept_appends_source(self, tmp_path): - wiki = tmp_path / "wiki" - concepts = wiki / "concepts" - concepts.mkdir(parents=True) - (concepts / "attention.md").write_text( - "---\nsources: [paper1.pdf]\nbrief: Old brief\n---\n\n# Attention\n\nOld content.", - encoding="utf-8", - ) - _write_concept(wiki, "attention", "New info from paper2.", "paper2.pdf", True, brief="Updated brief") - text = (concepts / "attention.md").read_text() - assert "paper2.pdf" in text - assert "paper1.pdf" in text - assert "brief: Updated brief" in text - assert "New info from paper2." in text -``` - -- [ ] **Step 2: Run tests to verify they fail** - -Run: `pytest tests/test_compiler.py::TestWriteSummary tests/test_compiler.py::TestWriteConcept -v` -Expected: FAIL — `_write_summary` and `_write_concept` don't accept `brief` parameter - -- [ ] **Step 3: Update `_write_summary` to accept `brief`** - -```python -def _write_summary(wiki_dir: Path, doc_name: str, source_file: str, summary: str, brief: str = "") -> None: - """Write summary page with frontmatter.""" - summaries_dir = wiki_dir / "summaries" - summaries_dir.mkdir(parents=True, exist_ok=True) - fm_lines = [f"sources: [{source_file}]"] - if brief: - fm_lines.append(f"brief: {brief}") - frontmatter = "---\n" + "\n".join(fm_lines) + "\n---\n\n" - (summaries_dir / f"{doc_name}.md").write_text(frontmatter + summary, encoding="utf-8") -``` - -- [ ] **Step 4: Update `_write_concept` to accept `brief`** - -Add `brief: str = ""` parameter to `_write_concept`. In the new-concept branch: - -```python - else: - fm_lines = [f"sources: [{source_file}]"] - if brief: - fm_lines.append(f"brief: {brief}") - frontmatter = "---\n" + "\n".join(fm_lines) + "\n---\n\n" - path.write_text(frontmatter + content, encoding="utf-8") -``` - -In the update branch, after updating sources in frontmatter, also update brief: - -```python - if is_update and path.exists(): - existing = path.read_text(encoding="utf-8") - if source_file not in existing: - # ... existing frontmatter update logic ... - # Update brief in frontmatter if provided - if brief and existing.startswith("---"): - end = existing.find("---", 3) - if end != -1: - fm = existing[:end + 3] - body = existing[end + 3:] - if "brief:" in fm: - import re - fm = re.sub(r"brief:.*", f"brief: {brief}", fm) - else: - fm = fm.replace("---\n", f"---\nbrief: {brief}\n", 1) - existing = fm + body - path.write_text(existing, encoding="utf-8") -``` - -- [ ] **Step 5: Run tests to verify they pass** - -Run: `pytest tests/test_compiler.py::TestWriteSummary tests/test_compiler.py::TestWriteConcept -v` -Expected: All PASS - -- [ ] **Step 6: Commit** - -```bash -git add openkb/agent/compiler.py tests/test_compiler.py -git commit -m "feat: store brief in frontmatter of summary and concept pages" -``` - ---- - -### Task 4: Update `_update_index` to include briefs, and update `_read_concept_briefs` to read from frontmatter - -**Files:** -- Modify: `openkb/agent/compiler.py` (lines 233-261 and 408-430) -- Modify: `tests/test_compiler.py` - -- [ ] **Step 1: Write failing tests for `_update_index` with briefs** - -```python -class TestUpdateIndex: - def test_appends_entries_with_briefs(self, tmp_path): - wiki = tmp_path / "wiki" - wiki.mkdir() - (wiki / "index.md").write_text( - "# Index\n\n## Documents\n\n## Concepts\n\n## Explorations\n", - encoding="utf-8", - ) - _update_index(wiki, "my-doc", ["attention", "transformer"], - doc_brief="Introduces transformers", - concept_briefs={"attention": "Focus mechanism", "transformer": "NN architecture"}) - text = (wiki / "index.md").read_text() - assert "[[summaries/my-doc]] — Introduces transformers" in text - assert "[[concepts/attention]] — Focus mechanism" in text - assert "[[concepts/transformer]] — NN architecture" in text - - def test_no_duplicates(self, tmp_path): - wiki = tmp_path / "wiki" - wiki.mkdir() - (wiki / "index.md").write_text( - "# Index\n\n## Documents\n- [[summaries/my-doc]] — Old brief\n\n## Concepts\n", - encoding="utf-8", - ) - _update_index(wiki, "my-doc", [], doc_brief="New brief") - text = (wiki / "index.md").read_text() - assert text.count("[[summaries/my-doc]]") == 1 - - def test_backwards_compat_no_briefs(self, tmp_path): - wiki = tmp_path / "wiki" - wiki.mkdir() - (wiki / "index.md").write_text( - "# Index\n\n## Documents\n\n## Concepts\n\n## Explorations\n", - encoding="utf-8", - ) - _update_index(wiki, "my-doc", ["attention"]) - text = (wiki / "index.md").read_text() - assert "[[summaries/my-doc]]" in text - assert "[[concepts/attention]]" in text -``` - -Write test for updated `_read_concept_briefs`: - -```python -class TestReadConceptBriefs: - # ... keep existing tests ... - - def test_reads_brief_from_frontmatter(self, tmp_path): - wiki = tmp_path / "wiki" - concepts = wiki / "concepts" - concepts.mkdir(parents=True) - (concepts / "attention.md").write_text( - "---\nsources: [paper.pdf]\nbrief: Selective focus mechanism\n---\n\n# Attention\n\nLong content...", - encoding="utf-8", - ) - result = _read_concept_briefs(wiki) - assert "- attention: Selective focus mechanism" in result - - def test_falls_back_to_body_truncation(self, tmp_path): - wiki = tmp_path / "wiki" - concepts = wiki / "concepts" - concepts.mkdir(parents=True) - (concepts / "old.md").write_text( - "---\nsources: [paper.pdf]\n---\n\nOld concept without brief field.", - encoding="utf-8", - ) - result = _read_concept_briefs(wiki) - assert "- old: Old concept without brief field." in result -``` - -- [ ] **Step 2: Run tests to verify they fail** - -Run: `pytest tests/test_compiler.py::TestUpdateIndex tests/test_compiler.py::TestReadConceptBriefs -v` -Expected: FAIL — `_update_index` doesn't accept `doc_brief`/`concept_briefs` parameters - -- [ ] **Step 3: Update `_update_index`** - -```python -def _update_index( - wiki_dir: Path, doc_name: str, concept_names: list[str], - doc_brief: str = "", concept_briefs: dict[str, str] | None = None, -) -> None: - """Append document and concept entries to index.md with optional briefs.""" - index_path = wiki_dir / "index.md" - if not index_path.exists(): - index_path.write_text( - "# Knowledge Base Index\n\n## Documents\n\n## Concepts\n\n## Explorations\n", - encoding="utf-8", - ) - - text = index_path.read_text(encoding="utf-8") - - doc_link = f"[[summaries/{doc_name}]]" - if doc_link not in text: - doc_entry = f"- {doc_link}" - if doc_brief: - doc_entry += f" — {doc_brief}" - if "## Documents" in text: - text = text.replace("## Documents\n", f"## Documents\n{doc_entry}\n", 1) - - if concept_briefs is None: - concept_briefs = {} - for name in concept_names: - concept_link = f"[[concepts/{name}]]" - if concept_link not in text: - concept_entry = f"- {concept_link}" - if name in concept_briefs: - concept_entry += f" — {concept_briefs[name]}" - if "## Concepts" in text: - text = text.replace("## Concepts\n", f"## Concepts\n{concept_entry}\n", 1) - - index_path.write_text(text, encoding="utf-8") -``` - -- [ ] **Step 4: Update `_read_concept_briefs` to read from frontmatter `brief:` field** - -```python -def _read_concept_briefs(wiki_dir: Path) -> str: - """Read existing concept pages and return compact one-line summaries. - - Reads ``brief:`` from YAML frontmatter if available, otherwise falls back - to the first 150 characters of the body text. - """ - concepts_dir = wiki_dir / "concepts" - if not concepts_dir.exists(): - return "(none yet)" - - md_files = sorted(concepts_dir.glob("*.md")) - if not md_files: - return "(none yet)" - - lines: list[str] = [] - for path in md_files: - text = path.read_text(encoding="utf-8") - brief = "" - body = text - if text.startswith("---"): - end = text.find("---", 3) - if end != -1: - fm = text[:end + 3] - body = text[end + 3:] - # Try to extract brief from frontmatter - for line in fm.split("\n"): - if line.startswith("brief:"): - brief = line[len("brief:"):].strip() - break - if not brief: - brief = body.strip().replace("\n", " ")[:150] - if brief: - lines.append(f"- {path.stem}: {brief}") - - return "\n".join(lines) or "(none yet)" -``` - -- [ ] **Step 5: Run tests** - -Run: `pytest tests/test_compiler.py -v` -Expected: All PASS - -- [ ] **Step 6: Commit** - -```bash -git add openkb/agent/compiler.py tests/test_compiler.py -git commit -m "feat: add briefs to index.md entries and read from frontmatter" -``` - ---- - -### Task 5: Wire briefs through `_compile_concepts` and public functions - -**Files:** -- Modify: `openkb/agent/compiler.py` (lines 438-611, `_compile_concepts`, `compile_short_doc`, `compile_long_doc`) -- Modify: `tests/test_compiler.py` - -This task connects the brief+content JSON parsing to the write functions and index update. - -- [ ] **Step 1: Write integration test** - -```python -class TestBriefIntegration: - @pytest.mark.asyncio - async def test_short_doc_briefs_in_index_and_frontmatter(self, tmp_path): - wiki = tmp_path / "wiki" - (wiki / "sources").mkdir(parents=True) - (wiki / "summaries").mkdir(parents=True) - (wiki / "concepts").mkdir(parents=True) - (wiki / "index.md").write_text( - "# Index\n\n## Documents\n\n## Concepts\n\n## Explorations\n", - encoding="utf-8", - ) - source_path = wiki / "sources" / "test-doc.md" - source_path.write_text("# Test Doc\n\nContent.", encoding="utf-8") - (tmp_path / ".openkb").mkdir() - (tmp_path / "raw").mkdir() - (tmp_path / "raw" / "test-doc.pdf").write_bytes(b"fake") - - summary_resp = json.dumps({ - "brief": "A paper about transformers", - "content": "# Summary\n\nThis paper discusses transformers.", - }) - plan_resp = json.dumps({ - "create": [{"name": "transformer", "title": "Transformer"}], - "update": [], - "related": [], - }) - concept_resp = json.dumps({ - "brief": "NN architecture using self-attention", - "content": "# Transformer\n\nA neural network architecture.", - }) - - with patch("openkb.agent.compiler.litellm") as mock_litellm: - mock_litellm.completion = MagicMock( - side_effect=_mock_completion([summary_resp, plan_resp]) - ) - mock_litellm.acompletion = AsyncMock( - side_effect=_mock_acompletion([concept_resp]) - ) - await compile_short_doc("test-doc", source_path, tmp_path, "gpt-4o-mini") - - # Check summary frontmatter has brief - summary_text = (wiki / "summaries" / "test-doc.md").read_text() - assert "brief: A paper about transformers" in summary_text - - # Check concept frontmatter has brief - concept_text = (wiki / "concepts" / "transformer.md").read_text() - assert "brief: NN architecture using self-attention" in concept_text - - # Check index has briefs - index_text = (wiki / "index.md").read_text() - assert "[[summaries/test-doc]] — A paper about transformers" in index_text - assert "[[concepts/transformer]] — NN architecture using self-attention" in index_text -``` - -- [ ] **Step 2: Run test to verify it fails** - -Run: `pytest tests/test_compiler.py::TestBriefIntegration -v` -Expected: FAIL - -- [ ] **Step 3: Update `compile_short_doc` to parse brief+content from summary response** - -In `compile_short_doc`, replace: - -```python - # --- Step 1: Generate summary --- - summary = _llm_call(model, [system_msg, doc_msg], "summary") - _write_summary(wiki_dir, doc_name, source_file, summary) -``` - -With: - -```python - # --- Step 1: Generate summary --- - summary_raw = _llm_call(model, [system_msg, doc_msg], "summary") - try: - summary_parsed = _parse_json(summary_raw) - doc_brief = summary_parsed.get("brief", "") - summary = summary_parsed.get("content", summary_raw) - except (json.JSONDecodeError, ValueError): - doc_brief = "" - summary = summary_raw - _write_summary(wiki_dir, doc_name, source_file, summary, brief=doc_brief) -``` - -- [ ] **Step 4: Update `_compile_concepts` signature and wiring** - -Add `doc_brief: str = ""` parameter to `_compile_concepts`. - -In `_gen_create`, parse the response: - -```python - async def _gen_create(concept: dict) -> tuple[str, str, bool, str]: - name = concept["name"] - title = concept.get("title", name) - async with semaphore: - raw = await _llm_call_async(model, [ - system_msg, doc_msg, - {"role": "assistant", "content": summary}, - {"role": "user", "content": _CONCEPT_PAGE_USER.format( - title=title, doc_name=doc_name, update_instruction="", - )}, - ], f"create:{name}") - try: - parsed = _parse_json(raw) - brief = parsed.get("brief", "") - content = parsed.get("content", raw) - except (json.JSONDecodeError, ValueError): - brief, content = "", raw - return name, content, False, brief -``` - -Same for `_gen_update` — returns `tuple[str, str, bool, str]` (name, content, is_update, brief). - -In the results processing loop: - -```python - concept_briefs_map: dict[str, str] = {} - for r in results: - if isinstance(r, Exception): - logger.warning("Concept generation failed: %s", r) - continue - name, page_content, is_update, brief = r - _write_concept(wiki_dir, name, page_content, source_file, is_update, brief=brief) - concept_names.append(name) - if brief: - concept_briefs_map[name] = brief -``` - -Pass briefs to `_update_index`: - -```python - _update_index(wiki_dir, doc_name, concept_names, - doc_brief=doc_brief, concept_briefs=concept_briefs_map) -``` - -- [ ] **Step 5: Update `compile_short_doc` to pass `doc_brief` to `_compile_concepts`** - -```python - await _compile_concepts( - wiki_dir, kb_dir, model, system_msg, doc_msg, - summary, doc_name, max_concurrency, doc_brief=doc_brief, - ) -``` - -- [ ] **Step 6: Update `compile_long_doc` to pass `doc_brief` from `IndexResult.description`** - -`compile_long_doc` currently takes `doc_id` but not `description`. Add `doc_description: str = ""` parameter: - -```python -async def compile_long_doc( - doc_name: str, - summary_path: Path, - doc_id: str, - kb_dir: Path, - model: str, - doc_description: str = "", - max_concurrency: int = DEFAULT_COMPILE_CONCURRENCY, -) -> None: -``` - -The `_LONG_DOC_SUMMARY_USER` stays unchanged (returns plain text, not JSON). Pass `doc_description` as `doc_brief`: - -```python - await _compile_concepts( - wiki_dir, kb_dir, model, system_msg, doc_msg, - overview, doc_name, max_concurrency, doc_brief=doc_description, - ) -``` - -Also update the CLI call in `cli.py` line 135: - -```python -asyncio.run( - compile_long_doc(doc_name, summary_path, index_result.doc_id, kb_dir, model, - doc_description=index_result.description) -) -``` - -- [ ] **Step 7: Update existing integration tests for new JSON response format** - -Update all mock LLM responses in `TestCompileShortDoc`, `TestCompileLongDoc`, and `TestCompileConceptsPlan` to return `{"brief": "...", "content": "..."}` JSON instead of plain text for summary and concept responses. - -- [ ] **Step 8: Run all tests** - -Run: `pytest tests/ -q` -Expected: All PASS - -- [ ] **Step 9: Commit** - -```bash -git add openkb/agent/compiler.py openkb/cli.py tests/test_compiler.py -git commit -m "feat: wire brief+content JSON through compile pipeline to index and frontmatter" -``` - ---- - -### Task 6: Indexer — long doc sources from markdown to JSON - -**Files:** -- Modify: `openkb/indexer.py` -- Modify: `openkb/tree_renderer.py` (remove `render_source_md`) -- Modify: `tests/test_indexer.py` - -- [ ] **Step 1: Write failing test** - -Update `tests/test_indexer.py`: - -```python - def test_source_page_written_as_json(self, kb_dir, sample_tree, tmp_path): - """Long doc source should be written as JSON, not markdown.""" - import json as json_mod - doc_id = "abc-123" - fake_col = self._make_fake_collection(doc_id, sample_tree) - - fake_client = MagicMock() - fake_client.collection.return_value = fake_col - # Mock get_page_content to return page data - fake_col.get_page_content.return_value = [ - {"page": 1, "content": "Page one text."}, - {"page": 2, "content": "Page two text."}, - ] - - pdf_path = tmp_path / "sample.pdf" - pdf_path.write_bytes(b"%PDF-1.4 fake") - - with patch("openkb.indexer.PageIndexClient", return_value=fake_client): - index_long_document(pdf_path, kb_dir) - - # Should be JSON, not MD - json_file = kb_dir / "wiki" / "sources" / "sample.json" - assert json_file.exists() - assert not (kb_dir / "wiki" / "sources" / "sample.md").exists() - data = json_mod.loads(json_file.read_text()) - assert len(data) == 2 - assert data[0]["page"] == 1 -``` - -- [ ] **Step 2: Run test to verify it fails** - -Run: `pytest tests/test_indexer.py::TestIndexLongDocument::test_source_page_written_as_json -v` -Expected: FAIL - -- [ ] **Step 3: Update `indexer.py` to write JSON sources** - -Replace the source writing block (lines 103-110) with: - -```python - # Write wiki/sources/ as JSON (per-page content from PageIndex) - sources_dir = kb_dir / "wiki" / "sources" - sources_dir.mkdir(parents=True, exist_ok=True) - dest_images_dir = sources_dir / "images" / pdf_path.stem - - # Get per-page content from PageIndex - all_pages = col.get_page_content(doc_id, f"1-{doc.get('page_count', 9999)}") - - # Relocate image paths - dest_images_dir.mkdir(parents=True, exist_ok=True) - for page in all_pages: - if "images" in page: - for img in page["images"]: - src_path = Path(img["path"]) - if src_path.exists(): - filename = src_path.name - dest = dest_images_dir / filename - if not dest.exists(): - shutil.copy2(src_path, dest) - img["path"] = f"images/{pdf_path.stem}/{filename}" - - import json as json_mod - (sources_dir / f"{pdf_path.stem}.json").write_text( - json_mod.dumps(all_pages, ensure_ascii=False, indent=2), encoding="utf-8", - ) -``` - -Remove the `render_source_md` import and `_relocate_images` call. - -- [ ] **Step 4: Remove `render_source_md` from tree_renderer.py** - -Remove the `render_source_md` function and `_render_nodes_source` helper from `openkb/tree_renderer.py`. Keep `render_summary_md` and `_render_nodes_summary`. - -- [ ] **Step 5: Update existing test `test_source_page_written`** - -The old test checks for `.md` — update it to check for `.json` or remove it (replaced by the new test). - -- [ ] **Step 6: Run all tests** - -Run: `pytest tests/ -q` -Expected: All PASS - -- [ ] **Step 7: Commit** - -```bash -git add openkb/indexer.py openkb/tree_renderer.py tests/test_indexer.py -git commit -m "feat: store long doc sources as per-page JSON, remove render_source_md" -``` - ---- - -### Task 7: Query agent — remove `pageindex_retrieve`, add `get_page_content`, update instructions - -**Files:** -- Modify: `openkb/agent/query.py` -- Modify: `openkb/schema.py` -- Modify: `tests/test_query.py` - -- [ ] **Step 1: Write failing tests** - -Update `tests/test_query.py`: - -```python -class TestBuildQueryAgent: - def test_agent_name(self, tmp_path): - agent = build_query_agent(str(tmp_path), "gpt-4o-mini") - assert agent.name == "wiki-query" - - def test_agent_has_three_tools(self, tmp_path): - agent = build_query_agent(str(tmp_path), "gpt-4o-mini") - assert len(agent.tools) == 3 - - def test_agent_tool_names(self, tmp_path): - agent = build_query_agent(str(tmp_path), "gpt-4o-mini") - names = {t.name for t in agent.tools} - assert "list_files" in names - assert "read_file" in names - assert "get_page_content" in names - assert "pageindex_retrieve" not in names - - def test_instructions_mention_get_page_content(self, tmp_path): - agent = build_query_agent(str(tmp_path), "gpt-4o-mini") - assert "get_page_content" in agent.instructions -``` - -- [ ] **Step 2: Run tests to verify they fail** - -Run: `pytest tests/test_query.py::TestBuildQueryAgent -v` -Expected: FAIL — old signature requires `openkb_dir` - -- [ ] **Step 3: Rewrite `query.py`** - -Remove `_pageindex_retrieve_impl` entirely (~110 lines). Remove `PageIndexClient` import. Update `build_query_agent`: - -```python -def build_query_agent(wiki_root: str, model: str, language: str = "en") -> Agent: - """Build and return the Q&A agent.""" - schema_md = get_agents_md(Path(wiki_root)) - instructions = _QUERY_INSTRUCTIONS_TEMPLATE.format(schema_md=schema_md) - instructions += f"\n\nIMPORTANT: Write all wiki content in {language} language." - - @function_tool - def list_files(directory: str) -> str: - """List all Markdown files in a wiki subdirectory.""" - return list_wiki_files(directory, wiki_root) - - @function_tool - def read_file(path: str) -> str: - """Read a Markdown file from the wiki.""" - return read_wiki_file(path, wiki_root) - - @function_tool - def get_page_content_tool(doc_name: str, pages: str) -> str: - """Get text content of specific pages from a long document. - - Args: - doc_name: Document name (e.g. 'attention-is-all-you-need'). - pages: Page specification (e.g. '3-5,7,10-12'). - """ - from openkb.agent.tools import get_page_content - return get_page_content(doc_name, pages, wiki_root) - - from agents.model_settings import ModelSettings - - return Agent( - name="wiki-query", - instructions=instructions, - tools=[list_files, read_file, get_page_content_tool], - model=f"litellm/{model}", - model_settings=ModelSettings(parallel_tool_calls=False), - ) -``` - -Update `_QUERY_INSTRUCTIONS_TEMPLATE`: - -```python -_QUERY_INSTRUCTIONS_TEMPLATE = """\ -You are a knowledge-base Q&A agent. You answer questions by searching the wiki. - -{schema_md} - -## Search strategy -1. Read index.md to understand what documents and concepts are available. - Each entry has a brief summary to help you judge relevance. -2. Read relevant summary pages (summaries/) for document overviews. -3. Read concept pages (concepts/) for cross-document synthesis. -4. For long documents, use get_page_content(doc_name, pages) to read - specific pages when you need detailed content. The summary page - shows chapter structure with page ranges to help you decide which - pages to read. -5. Synthesise a clear, well-cited answer. - -Always ground your answer in the wiki content. If you cannot find relevant -information, say so clearly. -""" -``` - -Update `run_query` to match new `build_query_agent` signature (remove `openkb_dir` param): - -```python -async def run_query(question: str, kb_dir: Path, model: str, stream: bool = False) -> str: - from openkb.config import load_config - openkb_dir = kb_dir / ".openkb" - config = load_config(openkb_dir / "config.yaml") - language: str = config.get("language", "en") - - wiki_root = str(kb_dir / "wiki") - agent = build_query_agent(wiki_root, model, language=language) - # ... rest unchanged ... -``` - -- [ ] **Step 4: Update `openkb/schema.py` AGENTS_MD** - -Add a note about `get_page_content` for long documents in the Schema: - -```python -## Page Types -- **Summary Page** (summaries/): Key content of a single source document. -- **Concept Page** (concepts/): Cross-document topic synthesis with [[wikilinks]]. -- **Exploration Page** (explorations/): Saved query results — analyses, comparisons, syntheses. -- **Source Page** (sources/): Full-text for short docs (.md) or per-page JSON for long docs (.json). -- **Index Page** (index.md): One-liner summary of every page in the wiki. Auto-maintained. -``` - -- [ ] **Step 5: Run all tests** - -Run: `pytest tests/ -q` -Expected: All PASS - -- [ ] **Step 6: Commit** - -```bash -git add openkb/agent/query.py openkb/schema.py tests/test_query.py -git commit -m "feat: replace pageindex_retrieve with get_page_content, unify query for all docs" -``` - ---- - -### Task 8: Final cleanup and full verification - -**Files:** -- Modify: `openkb/indexer.py` (remove unused imports) -- Verify all files - -- [ ] **Step 1: Remove unused imports** - -In `indexer.py`, remove `from openkb.tree_renderer import render_source_md` if still present (keep `render_summary_md`). - -In `query.py`, verify `PageIndexClient` import is removed. - -- [ ] **Step 2: Run full test suite** - -Run: `pytest tests/ -v` -Expected: All PASS - -- [ ] **Step 3: Grep for dead references** - -Run: `grep -r "pageindex_retrieve\|render_source_md\|_relocate_images" openkb/ tests/` -Expected: No matches - -- [ ] **Step 4: Commit** - -```bash -git add -A -git commit -m "chore: remove dead imports and references" -``` diff --git a/docs/superpowers/specs/2026-04-09-concept-dedup-and-update-design.md b/docs/superpowers/specs/2026-04-09-concept-dedup-and-update-design.md deleted file mode 100644 index 2fcd853..0000000 --- a/docs/superpowers/specs/2026-04-09-concept-dedup-and-update-design.md +++ /dev/null @@ -1,163 +0,0 @@ -# Concept Dedup & Existing Page Update - -**Date:** 2026-04-09 -**Status:** Approved -**Branch:** bugfix/compile - -## Problem - -The compiler pipeline generates concept pages per document, but: - -1. **No dedup** — LLM only sees concept slug names, not content. It can't reliably judge whether a new concept overlaps with an existing one. As the KB grows, concepts duplicate and diverge. -2. **No update of existing pages** — When a new document has information relevant to existing concepts, those pages are not updated. Knowledge doesn't compound across documents. - -The old agent-based approach solved this (the agent could read/write wiki files freely), but was too slow — 20-30 tool-call round-trips per document. - -## Design - -Extend the existing deterministic pipeline to give the LLM enough context for dedup/update decisions, without adding agent loops or breaking prompt caching. - -### Prompt Caching Invariant - -The cached prefix `[system_msg, doc_msg]` must remain identical across all LLM calls within a single document compilation. All new context (concept briefs, existing page content) goes into messages **after** the cached prefix. - -### Pipeline Overview - -``` -Step 1: [system, doc] → summary (unchanged) -Step 2: [system, doc, summary, concepts_plan_prompt] → concepts plan JSON -Step 3a: [system, doc, summary, create_prompt] × N → new concept pages (concurrent) -Step 3b: [system, doc, summary, update_prompt] × M → rewritten concept pages (concurrent) -Step 3c: code-only × K → add cross-ref links to related concepts -Step 4: update index (unchanged) -``` - -Steps 3a and 3b share a single semaphore and run concurrently together. - -### Part 1: Concept Briefs - -New function `_read_concept_briefs(wiki_dir)` reads existing concept pages and returns a compact summary string: - -``` -- attention: Attention is a mechanism that allows models to focus on relevant parts... -- transformer-architecture: The Transformer is a neural network architecture... -``` - -For each concept file in `wiki/concepts/*.md`: -- Skip YAML frontmatter -- Take first 150 characters of body text -- Format as `- {slug}: {brief}` - -This replaces the current `", ".join(existing_concepts)` in the concepts-list prompt. Pure file I/O, no LLM call. - -### Part 2: Concepts Plan Prompt - -The `_CONCEPTS_LIST_USER` template is replaced with a new `_CONCEPTS_PLAN_USER` template that asks the LLM to return a JSON object with three action types: - -```json -{ - "create": [{"name": "flash-attention", "title": "Flash Attention"}], - "update": [{"name": "attention", "title": "Attention Mechanism"}], - "related": ["transformer-architecture"] -} -``` - -- **create** — New concept not covered by any existing page. -- **update** — Existing concept with significant new information worth integrating. -- **related** — Existing concept tangentially related; only needs a cross-reference link. - -The prompt includes rules: -- Don't create concepts that overlap with existing ones — use "update" instead. -- Don't create concepts that are just the document topic itself. -- For first few documents, create 2-3 foundational concepts at most. -- "related" is for lightweight cross-linking only. - -### Part 3: Three Execution Paths - -#### create (unchanged) - -Same as current: concurrent `_llm_call_async` with `_CONCEPT_PAGE_USER` template. Written via `_write_concept` with `is_update=False`. - -#### update (new) - -New template `_CONCEPT_UPDATE_USER`: - -``` -Update the concept page for: {title} - -Current content of this page: -{existing_content} - -New information from document "{doc_name}" (summarized above) should be -integrated into this page. Rewrite the full page incorporating the new -information naturally. Maintain existing cross-references and add new ones -where appropriate. - -Return ONLY the Markdown content (no frontmatter, no code fences). -``` - -Call structure: `[system_msg, doc_msg, {assistant: summary}, update_user_msg]` - -The cached prefix `[system_msg, doc_msg]` is shared with create calls. The `existing_content` (typically 200-500 tokens) is in the final user message only. - -Written via `_write_concept` with `is_update=True`. The frontmatter `sources:` list is updated to include the new source file. - -#### related (code-only, no LLM) - -For each related slug: -1. Read the concept file -2. If `summaries/{doc_name}` is not already linked, append `\n\nSee also: [[summaries/{doc_name}]]` -3. Update frontmatter `sources:` list - -Pure file I/O, millisecond-level. - -### Part 4: Shared Logic Between Short and Long Doc - -Current `compile_short_doc` and `compile_long_doc` duplicate Steps 2-4. Extract shared logic into `_compile_concepts(wiki_dir, model, system_msg, doc_msg, summary, doc_name, kb_dir, max_concurrency)`. - -Public functions become: -- `compile_short_doc`: builds context A from source text → calls `_compile_concepts` -- `compile_long_doc`: builds context A from PageIndex summary → calls `_compile_concepts` - -### Part 5: JSON Parsing Fallback - -If the LLM returns a flat JSON array instead of the expected dict, treat it as all "create" actions: - -```python -if isinstance(parsed, list): - create_list, update_list, related_list = parsed, [], [] -else: - create_list = parsed.get("create", []) - update_list = parsed.get("update", []) - related_list = parsed.get("related", []) -``` - -This ensures backward compatibility if the LLM doesn't follow the new format. - -## Token Cost Analysis - -Compared to current pipeline (per document with C existing concepts): - -| Step | Current | New | Delta | -|------|---------|-----|-------| -| concepts-list prompt | ~50 tokens (slug names) | ~50 + C×30 tokens (briefs) | +C×30 | -| update calls | 0 | M × ~500 tokens (existing content) | +M×500 | -| related | 0 | 0 (code-only) | 0 | - -At C=30 existing concepts: +900 tokens in concepts-list prompt. -At M=2 update calls: +1000 tokens total. - -Total overhead: ~2000 tokens per document. Negligible compared to document content (5K-20K tokens). - -## Files Changed - -- `openkb/agent/compiler.py` — all changes - - New: `_read_concept_briefs()`, `_CONCEPTS_PLAN_USER`, `_CONCEPT_UPDATE_USER`, `_add_related_link()`, `_compile_concepts()` - - Modified: `compile_short_doc()`, `compile_long_doc()`, `_parse_json()` caller logic -- `tests/test_compiler.py` — update tests for new JSON format and update/related paths - -## Not In Scope - -- Concept briefs truncation/filtering for very large KBs (100+ concepts) — revisit when needed -- Interactive ingest (human-in-the-loop checkpoint) — separate feature -- Lint --fix auto-repair — separate feature diff --git a/docs/superpowers/specs/2026-04-09-retrieve-redesign.md b/docs/superpowers/specs/2026-04-09-retrieve-redesign.md deleted file mode 100644 index 15224be..0000000 --- a/docs/superpowers/specs/2026-04-09-retrieve-redesign.md +++ /dev/null @@ -1,262 +0,0 @@ -# Retrieve Redesign: Unified Query, Brief Summaries, and Local Page Content - -**Date:** 2026-04-09 -**Status:** Approved -**Branch:** bugfix/compile - -## Problems - -### 1. Long vs Short Doc Split in Query - -The query agent treats long documents (PageIndex-indexed) and short documents differently: - -- **Short docs**: agent reads `wiki/sources/{name}.md` via `read_file` -- **Long docs**: agent calls `pageindex_retrieve(doc_id, question)` — a black-box RAG call - -**Design Principle**: PageIndex is an indexer, not a retriever. Query-time retrieval should be done by the agent navigating the wiki, using the same tools for all documents. - -### 2. index.md Has No Brief Summaries - -Karpathy's gist says index.md should have "each page listed with a link, **a one-line summary**". Currently it only has wikilinks with no descriptions. The query agent must open every file to understand what's available. - -### 3. No Brief Summaries on Concepts Either - -Same problem: concept entries in index.md have no description. The agent can't judge relevance from the index alone. - -## Design - -### Part 1: Structured LLM Output with Brief Summaries - -All LLM generation steps (summary, concept create, concept update) now return a JSON object with both a one-line brief and the full content. - -#### Summary Generation - -`_SUMMARY_USER` prompt changes to request JSON output: - -``` -Write a summary page for this document in Markdown. - -Return a JSON object with two keys: -- "brief": A single sentence (under 100 chars) describing the document's main contribution -- "content": The full summary in Markdown. Include key concepts, findings, and [[wikilinks]] - -Return ONLY valid JSON, no fences. -``` - -LLM returns: -```json -{ - "brief": "Introduces the Transformer architecture based entirely on self-attention", - "content": "# Attention Is All You Need\n\nThis paper proposes..." -} -``` - -The `brief` is: -- Written into summary frontmatter: `brief: Introduces the Transformer...` -- Passed to `_update_index` for the Documents section - -The `content` is written to `wiki/summaries/{name}.md` as before. - -#### Concept Generation (create) - -`_CONCEPT_PAGE_USER` prompt changes similarly: - -``` -Write the concept page for: {title} - -Return a JSON object with two keys: -- "brief": A single sentence (under 100 chars) defining this concept -- "content": The full concept page in Markdown with [[wikilinks]] - -Return ONLY valid JSON, no fences. -``` - -The `brief` is: -- Written into concept frontmatter: `brief: Mechanism allowing each position to attend to all others` -- Passed to `_update_index` for the Concepts section -- Used by `_read_concept_briefs` (read from frontmatter instead of truncating body text) - -#### Concept Generation (update) - -`_CONCEPT_UPDATE_USER` also returns `{"brief": "...", "content": "..."}`. The brief may change as the concept evolves with new information. - -#### Long Doc Summary (overview) - -Long documents do NOT need the LLM to generate a brief. The brief comes directly from PageIndex's `doc_description` field (available via `IndexResult.description`), which is already a document-level summary generated during indexing. `_LONG_DOC_SUMMARY_USER` stays unchanged (returns plain markdown overview, not JSON) — the brief is passed through from the indexer. - -In `compile_long_doc`, the `doc_description` is passed to `_compile_concepts` which forwards it to `_update_index` as the doc brief. - -#### Parsing - -All LLM responses go through `_parse_json`. Callers extract `brief` and `content`: - -```python -parsed = _parse_json(raw) -brief = parsed.get("brief", "") -content = parsed.get("content", raw) # fallback: treat raw as content if not JSON -``` - -The fallback ensures backward compatibility if the LLM returns plain text instead of JSON. - -### Part 2: index.md with Brief Summaries - -`_update_index` signature changes: - -```python -def _update_index(wiki_dir, doc_name, concept_names, doc_brief="", concept_briefs=None): -``` - -Output format: - -```markdown -## Documents -- [[summaries/attention-is-all-you-need]] — Introduces the Transformer architecture based on self-attention -- [[summaries/flash-attention]] — Efficient attention algorithm reducing memory from quadratic to linear - -## Concepts -- [[concepts/self-attention]] — Mechanism allowing each position to attend to all others in a sequence -- [[concepts/transformer]] — Neural network architecture based entirely on attention mechanisms -``` - -When updating an existing entry (re-compile), the brief is updated in place. - -### Part 3: Frontmatter with Brief - -Summary and concept pages get a `brief` field in frontmatter: - -```markdown ---- -sources: [paper.pdf] -brief: Introduces the Transformer architecture based on self-attention ---- - -# Attention Is All You Need -... -``` - -`_read_concept_briefs` is updated to read from `brief:` frontmatter field instead of truncating body text. Fallback to body truncation if `brief:` is absent (backward compat with existing pages). - -### Part 4: Long Doc Sources from Markdown to JSON - -Store per-page content as JSON instead of a giant markdown file. - -**Current**: -``` -wiki/sources/paper.md ← rendered markdown, 10K-50K tokens -``` - -**New**: -``` -wiki/sources/paper.json ← per-page JSON array -``` - -**JSON format** (only the `pages` array from PageIndex, not the full doc object): -```json -[ - { - "page": 1, - "content": "Full text of page 1...", - "images": [{"path": "images/paper/p1_img1.png", "width": 400, "height": 300}] - }, - { - "page": 2, - "content": "Full text of page 2..." - } -] -``` - -`images` field is optional. Image paths are relative to `wiki/sources/`. Short documents are not affected — they stay as `.md`. - -#### Indexer Changes - -In `indexer.py`, replace `render_source_md` + `_relocate_images` with: -1. `col.get_page_content(doc_id, "1-9999")` to get all pages -2. Relocate image paths in each page's `images` array -3. Write as JSON to `wiki/sources/{name}.json` - -### Part 5: New Tool `get_page_content` - -Add to `openkb/agent/tools.py`: - -```python -def get_page_content(doc_name: str, pages: str, wiki_root: str) -> str: - """Get text content of specific pages from a long document. - - Args: - doc_name: Document name (e.g. 'attention-is-all-you-need'). - pages: Page specification (e.g. '3-5,7,10-12'). - wiki_root: Absolute path to the wiki root directory. - """ -``` - -Implementation: -1. Read `wiki/sources/{doc_name}.json` -2. Parse `pages` spec into a set of page numbers (comma-separated, ranges with `-`) -3. Filter pages, format as `[Page N]\n{content}\n\n` -4. Return concatenated text, or error if file not found - -### Part 6: Query Agent Changes - -**Remove**: `pageindex_retrieve` tool and `_pageindex_retrieve_impl` entirely. - -**Add**: `get_page_content` tool. - -**Update instructions**: -``` -## Search strategy -1. Read index.md to understand what documents and concepts are available. - Each entry has a brief summary to help you judge relevance. -2. Read relevant summary pages (summaries/) for document overviews. -3. Read concept pages (concepts/) for cross-document synthesis. -4. For long documents, use get_page_content(doc_name, pages) to read - specific pages. The summary page shows chapter structure with page - ranges to help you decide which pages to read. -5. Synthesise a clear, well-cited answer. -``` - -**Remove**: `openkb_dir` and `model` parameters from `build_query_agent`. - -### What Gets Removed - -- `_pageindex_retrieve_impl` (~110 lines) -- `pageindex_retrieve` tool -- `render_source_md` from `tree_renderer.py` -- `_relocate_images` in current form (replaced by per-page relocation) -- PageIndex imports in `query.py` - -### What Stays - -- `render_summary_md` — summaries still markdown -- Short doc pipeline — unchanged -- Image files in `wiki/sources/images/` -- PageIndex in `indexer.py` — still used for tree building - -## Compile Pipeline Changes Summary - -The compile pipeline (`_compile_concepts`, `compile_short_doc`, `compile_long_doc`) changes: - -1. **Summary step**: parse JSON response, extract `brief` + `content` -2. **Concept create/update steps**: parse JSON response, extract `brief` + `content` -3. **`_write_summary`**: add `brief` to frontmatter -4. **`_write_concept`**: add/update `brief` in frontmatter -5. **`_update_index`**: write `— {brief}` after each wikilink -6. **`_read_concept_briefs`**: read from `brief:` frontmatter field (fallback to body truncation) - -## Files Changed - -- `openkb/agent/compiler.py` — prompt templates return JSON with brief+content, parse responses, pass briefs to index/frontmatter -- `openkb/indexer.py` — sources output from md to json, image relocation per-page -- `openkb/agent/tools.py` — add `get_page_content` -- `openkb/agent/query.py` — remove `pageindex_retrieve`, add `get_page_content`, update instructions -- `openkb/tree_renderer.py` — remove `render_source_md` -- `openkb/schema.py` — update AGENTS_MD -- `tests/test_compiler.py` — update for JSON LLM responses -- `tests/test_indexer.py` — update for JSON output -- `tests/test_query.py` — update for new tool set -- `tests/test_agent_tools.py` — add tests for `get_page_content` - -## Not In Scope - -- Cloud PageIndex query support (removed entirely) -- Changes to the lint pipeline -- Interactive ingest