fix: populate _split_overlap metadata for word/token units in RecursiveDocumentSplitter#11825
Open
Osamaali313 wants to merge 1 commit into
Open
Conversation
…veDocumentSplitter _add_overlap_info computed the overlap length with self._chunk_length(prev_doc.content), which returns a word/token count for word/token split units, while curr_pos and split_idx_start are character offsets. Mixing units made overlap_length negative, so the guard never fired and the _split_overlap metadata was silently left empty for word/token splitting with overlap (char units were unaffected). Measure the overlap and ranges in characters via len(prev_doc.content). Adds a regression test and a release note.
|
@Osamaali313 is attempting to deploy a commit to the deepset Team on Vercel. A member of the Team first needs to authorize it. |
sjrl
reviewed
Jun 30, 2026
| @@ -0,0 +1,9 @@ | |||
| --- | |||
| fixes: | |||
Contributor
There was a problem hiding this comment.
we use restructed text markdown for our reno files so we should make sure to use double back ticks for inline code comments.
sjrl
reviewed
Jun 30, 2026
Comment on lines
+757
to
+759
| Regression: the overlap length was computed with a word/token count while | ||
| curr_pos/split_idx_start are character offsets, so it went negative and the | ||
| _split_overlap metadata was silently left empty for word/token units. |
Contributor
There was a problem hiding this comment.
Lets remove this, we don't refer to old implementations but the current state of the code.
Suggested change
| Regression: the overlap length was computed with a word/token count while | |
| curr_pos/split_idx_start are character offsets, so it went negative and the | |
| _split_overlap metadata was silently left empty for word/token units. |
Contributor
|
Hey @Osamaali313 thanks for opening the PR! Please address the failing CI issues. |
sjrl
reviewed
Jun 30, 2026
| ) | ||
|
|
||
|
|
||
| def test_run_split_by_dot_and_overlap_1_word_unit_split_overlap_metadata(): |
Contributor
There was a problem hiding this comment.
Lets name the test such that it represents the behavior we are expecting
Suggested change
| def test_run_split_by_dot_and_overlap_1_word_unit_split_overlap_metadata(): | |
| def test_word_unit_split_populates_split_overlap_metadata(): |
sjrl
reviewed
Jun 30, 2026
Comment on lines
+406
to
+410
| # curr_pos and split_idx_start are character offsets, so the overlap and | ||
| # range must be measured in characters too. Using self._chunk_length() | ||
| # here would mix units for word/token splitting (it returns a word/token | ||
| # count), making overlap_length negative and silently dropping the | ||
| # _split_overlap metadata. |
Contributor
There was a problem hiding this comment.
This is a bit wordy. Lets shorten to
Suggested change
| # curr_pos and split_idx_start are character offsets, so the overlap and | |
| # range must be measured in characters too. Using self._chunk_length() | |
| # here would mix units for word/token splitting (it returns a word/token | |
| # count), making overlap_length negative and silently dropping the | |
| # _split_overlap metadata. | |
| # curr_pos and split_idx_start are character offsets, so measure the | |
| # overlap and range in characters too (not via _chunk_length, which returns a word/token count). |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Related Issues
None — found while reviewing
RecursiveDocumentSplitteroverlap handling (a follow-up to thesplit_idx_startfamily, #11710/#11711, which fixed a different site).Proposed Changes
_add_overlap_infocomputed the overlap length withself._chunk_length(prev_doc.content), which returns a word/token count forsplit_unit="word"/"token", whilecurr_posandsplit_idx_startare character offsets. Mixing units madeoverlap_lengthnegative, so theif overlap_length > 0guard never fired and the_split_overlapmetadata was silently left empty for word/token splitting with overlap (char units were unaffected, since therelen == _chunk_length).Fix: measure the overlap and ranges in characters via
len(prev_doc.content).How did you test it?
Added
test_run_split_by_dot_and_overlap_1_word_unit_split_overlap_metadata, which fails onmain(empty metadata) and passes after the fix. Fulltest_recursive_splitter.py: 49 passed. Release note added underreleasenotes/notes/.Checklist
Disclaimer: this PR was prepared with AI assistance (Claude). I reviewed the change and verified it RED→GREEN against the real component before submitting.