fix: populate _split_overlap metadata for word/token units in RecursiveDocumentSplitter by Osamaali313 · Pull Request #11825 · deepset-ai/haystack

Osamaali313 · 2026-06-29T19:09:51Z

Related Issues

None — found while reviewing RecursiveDocumentSplitter overlap handling (a follow-up to the split_idx_start family, #11710/#11711, which fixed a different site).

Proposed Changes

_add_overlap_info computed the overlap length with self._chunk_length(prev_doc.content), which returns a word/token count for split_unit="word"/"token", while curr_pos and split_idx_start are character offsets. Mixing units made overlap_length negative, so the if overlap_length > 0 guard never fired and the _split_overlap metadata was silently left empty for word/token splitting with overlap (char units were unaffected, since there len == _chunk_length).

Fix: measure the overlap and ranges in characters via len(prev_doc.content).

# before (word unit, split_overlap=1): every chunk's _split_overlap == []
# after: overlap provenance is populated for all overlapping chunks

How did you test it?

Added test_run_split_by_dot_and_overlap_1_word_unit_split_overlap_metadata, which fails on main (empty metadata) and passes after the fix. Full test_recursive_splitter.py: 49 passed. Release note added under releasenotes/notes/.

Checklist

I have read the contributors guidelines and the code of conduct
I have updated the related issue with new insights and changes
I added unit tests and updated the docstrings
I've used one of the conventional commit types for my PR title
I documented my code
I ran pre-commit hooks and fixed any issue

Disclaimer: this PR was prepared with AI assistance (Claude). I reviewed the change and verified it RED→GREEN against the real component before submitting.

…veDocumentSplitter _add_overlap_info computed the overlap length with self._chunk_length(prev_doc.content), which returns a word/token count for word/token split units, while curr_pos and split_idx_start are character offsets. Mixing units made overlap_length negative, so the guard never fired and the _split_overlap metadata was silently left empty for word/token splitting with overlap (char units were unaffected). Measure the overlap and ranges in characters via len(prev_doc.content). Adds a regression test and a release note.

vercel · 2026-06-29T19:09:57Z

@Osamaali313 is attempting to deploy a commit to the deepset Team on Vercel.

A member of the Team first needs to authorize it.

CLAassistant · 2026-06-29T19:10:09Z

All committers have signed the CLA.

sjrl · 2026-06-30T07:47:37Z

@@ -0,0 +1,9 @@
+---
+fixes:


we use restructed text markdown for our reno files so we should make sure to use double back ticks for inline code comments.

sjrl · 2026-06-30T07:48:42Z

+    Regression: the overlap length was computed with a word/token count while
+    curr_pos/split_idx_start are character offsets, so it went negative and the
+    _split_overlap metadata was silently left empty for word/token units.


Lets remove this, we don't refer to old implementations but the current state of the code.

Suggested change

Regression: the overlap length was computed with a word/token count while

curr_pos/split_idx_start are character offsets, so it went negative and the

_split_overlap metadata was silently left empty for word/token units.

sjrl · 2026-06-30T07:50:27Z

Hey @Osamaali313 thanks for opening the PR! Please address the failing CI issues.

sjrl · 2026-06-30T07:55:12Z

        )


+def test_run_split_by_dot_and_overlap_1_word_unit_split_overlap_metadata():


Lets name the test such that it represents the behavior we are expecting

Suggested change

def test_run_split_by_dot_and_overlap_1_word_unit_split_overlap_metadata():

def test_word_unit_split_populates_split_overlap_metadata():

sjrl · 2026-06-30T07:56:31Z

+        # curr_pos and split_idx_start are character offsets, so the overlap and
+        # range must be measured in characters too. Using self._chunk_length()
+        # here would mix units for word/token splitting (it returns a word/token
+        # count), making overlap_length negative and silently dropping the
+        # _split_overlap metadata.


This is a bit wordy. Lets shorten to

Suggested change

# curr_pos and split_idx_start are character offsets, so the overlap and

# range must be measured in characters too. Using self._chunk_length()

# here would mix units for word/token splitting (it returns a word/token

# count), making overlap_length negative and silently dropping the

# _split_overlap metadata.

# curr_pos and split_idx_start are character offsets, so measure the

# overlap and range in characters too (not via _chunk_length, which returns a word/token count).

Osamaali313 requested a review from a team as a code owner June 29, 2026 19:09

Osamaali313 requested review from Copilot and sjrl and removed request for a team and Copilot June 29, 2026 19:09

github-actions Bot added the topic:tests label Jun 29, 2026

sjrl self-assigned this Jun 30, 2026

sjrl reviewed Jun 30, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: populate _split_overlap metadata for word/token units in RecursiveDocumentSplitter#11825

fix: populate _split_overlap metadata for word/token units in RecursiveDocumentSplitter#11825
Osamaali313 wants to merge 1 commit into
deepset-ai:mainfrom
Osamaali313:fix/recursive-splitter-word-token-overlap-metadata

Osamaali313 commented Jun 29, 2026

Uh oh!

vercel Bot commented Jun 29, 2026

Uh oh!

CLAassistant commented Jun 29, 2026 •

edited

Loading

Uh oh!

sjrl Jun 30, 2026

Uh oh!

sjrl Jun 30, 2026

Uh oh!

sjrl commented Jun 30, 2026

Uh oh!

sjrl Jun 30, 2026

Uh oh!

sjrl Jun 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	Regression: the overlap length was computed with a word/token count while
	curr_pos/split_idx_start are character offsets, so it went negative and the
	_split_overlap metadata was silently left empty for word/token units.

		)


		def test_run_split_by_dot_and_overlap_1_word_unit_split_overlap_metadata():

	def test_run_split_by_dot_and_overlap_1_word_unit_split_overlap_metadata():
	def test_word_unit_split_populates_split_overlap_metadata():

		@@ -0,0 +1,9 @@
		---
		fixes:

Uh oh!

Conversation

Osamaali313 commented Jun 29, 2026

Related Issues

Proposed Changes

How did you test it?

Checklist

Uh oh!

vercel Bot commented Jun 29, 2026

Uh oh!

CLAassistant commented Jun 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sjrl Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

sjrl Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

sjrl commented Jun 30, 2026

Uh oh!

sjrl Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

sjrl Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

CLAassistant commented Jun 29, 2026 •

edited

Loading