fix: remove docs by rejojer · Pull Request #9 · VectifyAI/OpenKB

rejojer · 2026-04-09T22:24:04Z

Summary

Unified summary frontmatter to doc_type + full_text, removed redundant fields (sources, brief, source_doc, doc_id)
Concept pages now link to summaries instead of raw PDF filenames
Fixed duplicate frontmatter bug in both summary and concept pages (prompt fix + strip fallback)
Improved query agent: use full_text field, restrict get_page_content to pageindex docs, add self-talk, concise answers
Fixed image path mismatch in pageindex JSON content
Removed page marker comments from short doc source markdown
Fixed warning suppression (markitdown overrides filters at import time)
Improved init prompts with explicit defaults, American English spelling
Various output formatting fixes (tool call spacing, step name colons, unicode ellipsis)

Test plan

openkb init shows correct prompts with defaults
openkb add short doc: clean single frontmatter, no page markers in source
openkb add long doc: correct image paths in JSON content
openkb query on short doc: reads source via read_file, no get_page_content
openkb query on long doc: uses get_page_content with targeted page ranges
No PyPDF2 deprecation warning during any operation

Add _CONCEPTS_PLAN_USER (create/update/related JSON structure) and _CONCEPT_UPDATE_USER templates; add TestParseConceptsPlan tests.

…le_concepts

- Restore markitdown[all] extras for docx/pptx/xlsx support - Sanitize concept names to prevent path traversal in compiler - Add path traversal guard in copy_relative_images - Fix _write_concept duplicate append when frontmatter lacks sources key - Remove dead write_wiki_files function - Fix watcher thread race in _schedule_flush - Warn when unimplemented --fix flag is used in lint command - Harden CI publish workflow with environment gate and SHA-pinned actions - Fix test_indexer to actually assert IndexConfig flag values - Fix test_converter to test correct PDF code path (pymupdf, not markitdown) - Use str.find() instead of str.index() in frontmatter parsing to avoid ValueError

- Add _backlink_summary: ensures summary pages link to all related concepts - Add _backlink_concepts: ensures concept pages link back to source summaries - _update_index auto-creates index.md if missing - Both merge into existing sections instead of duplicating

Adds parse_pages() to expand page specs like "1-3,7" into sorted deduplicated int lists, and get_page_content() to read per-page JSON (sources/{doc}.json) and format output with optional image paths. Includes path-traversal guard consistent with existing tools.

Replace _SUMMARY_USER, _CONCEPT_PAGE_USER, and _CONCEPT_UPDATE_USER to request a JSON object with "brief" (one-line summary) and "content" (full Markdown). Add TestParseBriefContent to tests/test_compiler.py.

…rontmatter

Replace markdown source generation with per-page JSON from PageIndex get_page_content; remove render_source_md, _render_nodes_source, _relocate_images, and _IMG_REF_RE. Image relocation is now done inline per page. Update tests to assert .json output and mock get_page_content.

…or all docs Remove _pageindex_retrieve_impl and the pageindex_retrieve tool; add get_page_content_tool that uses the local JSON-based page store for all long documents. Update instructions and schema description accordingly.

… indexer - Default model changed from gpt-5.4 to gpt-5.4-mini - Indexer get_page_content no longer uses hardcoded 9999 fallback - Infers page_count from structure end_index when doc lacks page_count field - Added debug logging for doc keys and page_count diagnosis

…e backlink for short docs - index.md entries now show (short) or (pageindex) type marker - Query agent prompt updated: guides agent to read sources for detail - Removed list_files tool from query agent (index.md is sufficient) - Short doc summaries now have source_doc frontmatter linking to sources/ - Reverted list_wiki_files to only list .md files - Fixed tests for model name change and agent tool count

…e content" This reverts commit be66e31.

Replace sources/brief/source_doc/doc_id/source fields with two consistent fields: doc_type (short|pageindex) and full_text pointing to the actual source content under sources/.

Concept pages now reference summaries/{doc}.md instead of raw PDF filenames. Also strips frontmatter from LLM content during concept updates to prevent duplicate YAML blocks. Removes unused _find_source_filename.

Add hint that summaries may omit details. Update search strategy to reference the full_text frontmatter field instead of hardcoded paths.

Remove frontmatter format from schema to avoid LLM copying it. Add strip as fallback in _write_summary and _write_concept create path.

…n English

…remove empty dirs on init

Replace PageIndex get_page_content with pymupdf-based convert_pdf_to_pages for long doc JSON generation. All image paths now use sources/images/ prefix relative to wiki root. Removes dependency on PageIndex for source content.

Query agent can now view images referenced in source documents via get_image tool, which returns ToolOutputImage for the LLM to inspect. Prompt updated to use images when questions involve figures or visuals.

…KB dirs

Accept all origin/dev changes including: image support in query agent, robust JSON parsing with json_repair, unicode concept name support, section-based index operations, cloud/local page extraction fallback.

KylinMountain and others added 30 commits April 9, 2026 00:21

debug code about compile

cc12d95

feat: add _read_concept_briefs for concept dedup context

8640681

feat: add concepts plan and update prompt templates

4f1d332

Add _CONCEPTS_PLAN_USER (create/update/related JSON structure) and _CONCEPT_UPDATE_USER templates; add TestParseConceptsPlan tests.

feat: concept dedup with briefs, update/related paths, extract _compi…

fc0857e

…le_concepts

chore: update compiler docstring, remove dead _CONCEPTS_LIST_USER

4249d53

docs: specs and plans for concept dedup and retrieve redesign

072d9f5

feat: update LLM prompts to return brief+content JSON

b6ce04e

Replace _SUMMARY_USER, _CONCEPT_PAGE_USER, and _CONCEPT_UPDATE_USER to request a JSON object with "brief" (one-line summary) and "content" (full Markdown). Add TestParseBriefContent to tests/test_compiler.py.

feat: store brief in frontmatter of summary and concept pages

a172c43

feat: add briefs to index.md entries and read from frontmatter

ca23912

feat: wire brief+content JSON through compile pipeline to index and f…

5b086a5

…rontmatter

fix: remove tests for deleted render_source_md

8b75b7e

chore: remove dead references to render_source_md

36ae619

feat: warn when no LLM API key found instead of failing silently

739c8eb

fix: strengthen query agent instructions to always read source content

be66e31

Revert "fix: strengthen query agent instructions to always read sourc…

7b3bc0c

…e content" This reverts commit be66e31.

fix: isolate tests from real KB directories via mocking

634b212

fix: suppress warnings and disable agents SDK tracing via API

19ebfed

fix: add MAX_TURNS limit to agent Runner calls

dde64d1

refactor: unify summary frontmatter to doc_type + full_text

63da1fe

Replace sources/brief/source_doc/doc_id/source fields with two consistent fields: doc_type (short|pageindex) and full_text pointing to the actual source content under sources/.

fix: concept sources link to summaries and strip duplicate frontmatter

06e26ce

Concept pages now reference summaries/{doc}.md instead of raw PDF filenames. Also strips frontmatter from LLM content during concept updates to prevent duplicate YAML blocks. Removes unused _find_source_filename.

fix: update query agent to use summary full_text field

f38781e

Add hint that summaries may omit details. Update search strategy to reference the full_text frontmatter field instead of hardcoded paths.

fix: remove page marker comments from short doc source markdown

bebfbdb

fix: rename chapter structure to document tree structure in query prompt

4d34baf

rejojer and others added 14 commits April 10, 2026 04:35

fix: improve query agent prompt wording for source content

5f563ee

fix: move warning suppression after imports to avoid markitdown override

0b07a8e

fix: add blank line between tool calls and before answer in query output

45c5b6c

fix: add self-talk before tool calls and fix output formatting

0118d2d

fix: add space after colon in concept/update step names

15f970d

fix: prevent duplicate frontmatter in LLM-generated content

c8f96eb

Remove frontmatter format from schema to avoid LLM copying it. Add strip as fallback in _write_summary and _write_concept create path.

fix: improve init prompts, prevent duplicate frontmatter, use America…

febc8c9

…n English

fix: improve query agent tool descriptions and prompt clarity

4938cd7

fix: replace unicode ellipsis, fix image paths in pageindex content, …

5a1f014

…remove empty dirs on init

feat: add multimodal get_image tool to query agent

0340cb1

Query agent can now view images referenced in source documents via get_image tool, which returns ToolOutputImage for the LLM to inspect. Prompt updated to use images when questions involve figures or visuals.

fix: update tests for image path changes and removed init dirs

151b90e

fix: mock _find_kb_dir in test_add_missing_init to isolate from real …

f383fbe

…KB dirs

merge: resolve conflicts with origin/dev

a496d04

Accept all origin/dev changes including: image support in query agent, robust JSON parsing with json_repair, unicode concept name support, section-based index operations, cloud/local page extraction fallback.

KylinMountain changed the base branch from dev to main April 11, 2026 02:53

KylinMountain added 3 commits April 11, 2026 10:54

merge: incorporate origin/main into bugfix/compile

03ae61b

merge: reconcile with remote bugfix/compile

b245187

chore: remove docs/ directory from branch

a1460b4

KylinMountain changed the title ~~fix: compile pipeline, query agent, and frontmatter improvements~~ fix: remove docs Apr 11, 2026

KylinMountain merged commit 726336a into main Apr 11, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: remove docs#9

fix: remove docs#9
KylinMountain merged 47 commits intomainfrom
bugfix/compile

rejojer commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

rejojer commented Apr 9, 2026

Summary

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants