Skip to content

feat(document-api): implement doc.extract() for RAG content extraction (SD-2525)#2774

Merged
caio-pizzol merged 5 commits intomainfrom
caio/sd-2525-document-content-extraction-api-for-rag-pipelines
Apr 10, 2026
Merged

feat(document-api): implement doc.extract() for RAG content extraction (SD-2525)#2774
caio-pizzol merged 5 commits intomainfrom
caio/sd-2525-document-content-extraction-api-for-rag-pipelines

Conversation

@caio-pizzol
Copy link
Copy Markdown
Contributor

Single API method that extracts all document content with stable IDs for RAG pipelines.

  • editor.doc.extract() returns blocks with full text, comments with anchored block references, and tracked changes with excerpts
  • Every ID works directly with scrollToElement() for citation navigation
  • Follows the Document API contract pattern (4 touch points: operation-definitions, registry, schemas, dispatch)
  • No arbitrary limits — returns all blocks in document order
  • Full text per block (not the 80-char textPreview from blocks.list)

Usage:

const { blocks, comments, trackedChanges } = editor.doc.extract();

// Store IDs alongside embeddings
const chunks = blocks.map(b => ({ id: b.nodeId, text: b.text }));

// Navigate back on citation click
await superdoc.scrollToElement(chunk.id);

Closes SD-2525

@linear
Copy link
Copy Markdown

linear bot commented Apr 10, 2026

@mintlify
Copy link
Copy Markdown

mintlify bot commented Apr 10, 2026

Preview deployment for your docs. Learn more about Mintlify Previews.

Project Status Preview Updated (UTC)
SuperDoc 🟢 Ready View Preview Apr 10, 2026, 5:26 PM

💡 Tip: Enable Workflows to automatically generate PRs for you.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7e4827d3ab

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@caio-pizzol caio-pizzol force-pushed the caio/sd-2525-document-content-extraction-api-for-rag-pipelines branch from 7e4827d to 5cb7735 Compare April 10, 2026 17:44
…n (SD-2525)

Single API method that returns all document content with stable IDs —
blocks with full text, comments with anchored block references, and
tracked changes with excerpts. Every ID works directly with
scrollToElement() for citation navigation.
- Use canonical getHeadingLevel() instead of divergent local regex
- Reuse collectTopLevelBlocks() instead of duplicating block traversal
- Add required fields to extract output JSON schema
- Remove fixture-only unit tests that don't call executeExtract
- Add behavior tests: headings, comments, tracked changes, scrollToElement round-trip
@caio-pizzol caio-pizzol force-pushed the caio/sd-2525-document-content-extraction-api-for-rag-pipelines branch from beb1257 to 565e4a3 Compare April 10, 2026 18:46
@caio-pizzol caio-pizzol force-pushed the caio/sd-2525-document-content-extraction-api-for-rag-pipelines branch from 332999b to 10b1403 Compare April 10, 2026 20:39
@caio-pizzol caio-pizzol added this pull request to the merge queue Apr 10, 2026
Merged via the queue into main with commit c2f2577 Apr 10, 2026
53 of 56 checks passed
@caio-pizzol caio-pizzol deleted the caio/sd-2525-document-content-extraction-api-for-rag-pipelines branch April 10, 2026 21:09
@superdoc-bot
Copy link
Copy Markdown
Contributor

superdoc-bot bot commented Apr 10, 2026

🎉 This PR is included in @superdoc-dev/react v1.0.0-next.38

The release is available on GitHub release

@superdoc-bot
Copy link
Copy Markdown
Contributor

superdoc-bot bot commented Apr 10, 2026

🎉 This PR is included in esign v2.2.0-next.42

The release is available on GitHub release

@superdoc-bot
Copy link
Copy Markdown
Contributor

superdoc-bot bot commented Apr 10, 2026

🎉 This PR is included in vscode-ext v1.1.0-next.84

@superdoc-bot
Copy link
Copy Markdown
Contributor

superdoc-bot bot commented Apr 10, 2026

🎉 This PR is included in template-builder v1.3.0-next.44

The release is available on GitHub release

@superdoc-bot
Copy link
Copy Markdown
Contributor

superdoc-bot bot commented Apr 10, 2026

🎉 This PR is included in superdoc v1.24.0-next.81

The release is available on GitHub release

@superdoc-bot
Copy link
Copy Markdown
Contributor

superdoc-bot bot commented Apr 10, 2026

🎉 This PR is included in superdoc-cli v0.5.0-next.82

The release is available on GitHub release

@superdoc-bot
Copy link
Copy Markdown
Contributor

superdoc-bot bot commented Apr 10, 2026

🎉 This PR is included in superdoc-sdk v1.3.0-next.83

caio-pizzol added a commit that referenced this pull request Apr 10, 2026
…n (SD-2525) (#2774)

* feat(document-api): implement doc.extract() for RAG content extraction (SD-2525)

Single API method that returns all document content with stable IDs —
blocks with full text, comments with anchored block references, and
tracked changes with excerpts. Every ID works directly with
scrollToElement() for citation navigation.

* fix(document-api): review fixes — heading regex, schema required, tests

- Use canonical getHeadingLevel() instead of divergent local regex
- Reuse collectTopLevelBlocks() instead of duplicating block traversal
- Add required fields to extract output JSON schema
- Remove fixture-only unit tests that don't call executeExtract
- Add behavior tests: headings, comments, tracked changes, scrollToElement round-trip

* fix(tests): remove superdoc.click() — fixture uses type() for focus

* fix(cli): add extract operation hints for CLI/SDK wiring
@superdoc-bot
Copy link
Copy Markdown
Contributor

superdoc-bot bot commented Apr 10, 2026

🎉 This PR is included in vscode-ext v2.3.0-next.1

@superdoc-bot
Copy link
Copy Markdown
Contributor

superdoc-bot bot commented Apr 10, 2026

🎉 This PR is included in template-builder v1.5.0-next.1

The release is available on GitHub release

@superdoc-bot
Copy link
Copy Markdown
Contributor

superdoc-bot bot commented Apr 10, 2026

🎉 This PR is included in esign v2.3.0-next.1

The release is available on GitHub release

@superdoc-bot
Copy link
Copy Markdown
Contributor

superdoc-bot bot commented Apr 10, 2026

🎉 This PR is included in superdoc v1.26.0-next.1

The release is available on GitHub release

@superdoc-bot
Copy link
Copy Markdown
Contributor

superdoc-bot bot commented Apr 10, 2026

🎉 This PR is included in superdoc-cli v0.7.0-next.1

The release is available on GitHub release

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants