Skip to content

fix: populate _split_overlap metadata for word/token units in RecursiveDocumentSplitter#11825

Open
Osamaali313 wants to merge 1 commit into
deepset-ai:mainfrom
Osamaali313:fix/recursive-splitter-word-token-overlap-metadata
Open

fix: populate _split_overlap metadata for word/token units in RecursiveDocumentSplitter#11825
Osamaali313 wants to merge 1 commit into
deepset-ai:mainfrom
Osamaali313:fix/recursive-splitter-word-token-overlap-metadata

Conversation

@Osamaali313

Copy link
Copy Markdown

Related Issues

None — found while reviewing RecursiveDocumentSplitter overlap handling (a follow-up to the split_idx_start family, #11710/#11711, which fixed a different site).

Proposed Changes

_add_overlap_info computed the overlap length with self._chunk_length(prev_doc.content), which returns a word/token count for split_unit="word"/"token", while curr_pos and split_idx_start are character offsets. Mixing units made overlap_length negative, so the if overlap_length > 0 guard never fired and the _split_overlap metadata was silently left empty for word/token splitting with overlap (char units were unaffected, since there len == _chunk_length).

Fix: measure the overlap and ranges in characters via len(prev_doc.content).

# before (word unit, split_overlap=1): every chunk's _split_overlap == []
# after: overlap provenance is populated for all overlapping chunks

How did you test it?

Added test_run_split_by_dot_and_overlap_1_word_unit_split_overlap_metadata, which fails on main (empty metadata) and passes after the fix. Full test_recursive_splitter.py: 49 passed. Release note added under releasenotes/notes/.

Checklist


Disclaimer: this PR was prepared with AI assistance (Claude). I reviewed the change and verified it RED→GREEN against the real component before submitting.

…veDocumentSplitter

_add_overlap_info computed the overlap length with self._chunk_length(prev_doc.content), which returns a word/token count for word/token split units, while curr_pos and split_idx_start are character offsets. Mixing units made overlap_length negative, so the guard never fired and the _split_overlap metadata was silently left empty for word/token splitting with overlap (char units were unaffected). Measure the overlap and ranges in characters via len(prev_doc.content). Adds a regression test and a release note.
@Osamaali313 Osamaali313 requested a review from a team as a code owner June 29, 2026 19:09
@Osamaali313 Osamaali313 requested review from Copilot and sjrl and removed request for a team and Copilot June 29, 2026 19:09
@vercel

vercel Bot commented Jun 29, 2026

Copy link
Copy Markdown

@Osamaali313 is attempting to deploy a commit to the deepset Team on Vercel.

A member of the Team first needs to authorize it.

@CLAassistant

CLAassistant commented Jun 29, 2026

Copy link
Copy Markdown

CLA assistant check
All committers have signed the CLA.

@sjrl sjrl self-assigned this Jun 30, 2026
@@ -0,0 +1,9 @@
---
fixes:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we use restructed text markdown for our reno files so we should make sure to use double back ticks for inline code comments.

Comment on lines +757 to +759
Regression: the overlap length was computed with a word/token count while
curr_pos/split_idx_start are character offsets, so it went negative and the
_split_overlap metadata was silently left empty for word/token units.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets remove this, we don't refer to old implementations but the current state of the code.

Suggested change
Regression: the overlap length was computed with a word/token count while
curr_pos/split_idx_start are character offsets, so it went negative and the
_split_overlap metadata was silently left empty for word/token units.

@sjrl

sjrl commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

Hey @Osamaali313 thanks for opening the PR! Please address the failing CI issues.

)


def test_run_split_by_dot_and_overlap_1_word_unit_split_overlap_metadata():

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets name the test such that it represents the behavior we are expecting

Suggested change
def test_run_split_by_dot_and_overlap_1_word_unit_split_overlap_metadata():
def test_word_unit_split_populates_split_overlap_metadata():

Comment on lines +406 to +410
# curr_pos and split_idx_start are character offsets, so the overlap and
# range must be measured in characters too. Using self._chunk_length()
# here would mix units for word/token splitting (it returns a word/token
# count), making overlap_length negative and silently dropping the
# _split_overlap metadata.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit wordy. Lets shorten to

Suggested change
# curr_pos and split_idx_start are character offsets, so the overlap and
# range must be measured in characters too. Using self._chunk_length()
# here would mix units for word/token splitting (it returns a word/token
# count), making overlap_length negative and silently dropping the
# _split_overlap metadata.
# curr_pos and split_idx_start are character offsets, so measure the
# overlap and range in characters too (not via _chunk_length, which returns a word/token count).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants