Merged
Conversation
Add _CONCEPTS_PLAN_USER (create/update/related JSON structure) and _CONCEPT_UPDATE_USER templates; add TestParseConceptsPlan tests.
- Restore markitdown[all] extras for docx/pptx/xlsx support - Sanitize concept names to prevent path traversal in compiler - Add path traversal guard in copy_relative_images - Fix _write_concept duplicate append when frontmatter lacks sources key - Remove dead write_wiki_files function - Fix watcher thread race in _schedule_flush - Warn when unimplemented --fix flag is used in lint command - Harden CI publish workflow with environment gate and SHA-pinned actions - Fix test_indexer to actually assert IndexConfig flag values - Fix test_converter to test correct PDF code path (pymupdf, not markitdown) - Use str.find() instead of str.index() in frontmatter parsing to avoid ValueError
- Add _backlink_summary: ensures summary pages link to all related concepts - Add _backlink_concepts: ensures concept pages link back to source summaries - _update_index auto-creates index.md if missing - Both merge into existing sections instead of duplicating
Adds parse_pages() to expand page specs like "1-3,7" into sorted
deduplicated int lists, and get_page_content() to read per-page JSON
(sources/{doc}.json) and format output with optional image paths.
Includes path-traversal guard consistent with existing tools.
Replace _SUMMARY_USER, _CONCEPT_PAGE_USER, and _CONCEPT_UPDATE_USER to request a JSON object with "brief" (one-line summary) and "content" (full Markdown). Add TestParseBriefContent to tests/test_compiler.py.
Replace markdown source generation with per-page JSON from PageIndex get_page_content; remove render_source_md, _render_nodes_source, _relocate_images, and _IMG_REF_RE. Image relocation is now done inline per page. Update tests to assert .json output and mock get_page_content.
…or all docs Remove _pageindex_retrieve_impl and the pageindex_retrieve tool; add get_page_content_tool that uses the local JSON-based page store for all long documents. Update instructions and schema description accordingly.
… indexer - Default model changed from gpt-5.4 to gpt-5.4-mini - Indexer get_page_content no longer uses hardcoded 9999 fallback - Infers page_count from structure end_index when doc lacks page_count field - Added debug logging for doc keys and page_count diagnosis
…e backlink for short docs - index.md entries now show (short) or (pageindex) type marker - Query agent prompt updated: guides agent to read sources for detail - Removed list_files tool from query agent (index.md is sufficient) - Short doc summaries now have source_doc frontmatter linking to sources/ - Reverted list_wiki_files to only list .md files - Fixed tests for model name change and agent tool count
…e content" This reverts commit be66e31.
Replace sources/brief/source_doc/doc_id/source fields with two consistent fields: doc_type (short|pageindex) and full_text pointing to the actual source content under sources/.
Concept pages now reference summaries/{doc}.md instead of raw PDF
filenames. Also strips frontmatter from LLM content during concept
updates to prevent duplicate YAML blocks. Removes unused
_find_source_filename.
Add hint that summaries may omit details. Update search strategy to reference the full_text frontmatter field instead of hardcoded paths.
Remove frontmatter format from schema to avoid LLM copying it. Add strip as fallback in _write_summary and _write_concept create path.
…remove empty dirs on init
Replace PageIndex get_page_content with pymupdf-based convert_pdf_to_pages for long doc JSON generation. All image paths now use sources/images/ prefix relative to wiki root. Removes dependency on PageIndex for source content.
Query agent can now view images referenced in source documents via get_image tool, which returns ToolOutputImage for the LLM to inspect. Prompt updated to use images when questions involve figures or visuals.
Accept all origin/dev changes including: image support in query agent, robust JSON parsing with json_repair, unicode concept name support, section-based index operations, cloud/local page extraction fallback.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
doc_type+full_text, removed redundant fields (sources,brief,source_doc,doc_id)full_textfield, restrictget_page_contentto pageindex docs, add self-talk, concise answersTest plan
openkb initshows correct prompts with defaultsopenkb addshort doc: clean single frontmatter, no page markers in sourceopenkb addlong doc: correct image paths in JSON contentopenkb queryon short doc: reads source viaread_file, noget_page_contentopenkb queryon long doc: usesget_page_contentwith targeted page ranges