fix: respect ByteStream encoding in HTMLToDocument and MarkdownToDocument by greymoth-jp · Pull Request #11792 · deepset-ai/haystack

greymoth-jp · 2026-06-26T16:55:57Z

Summary

HTMLToDocument and MarkdownToDocument both hardcode .decode("utf-8"), which raises UnicodeDecodeError (or silently corrupts output) for any non-UTF-8 ByteStream whose encoding is passed via ByteStream.meta["encoding"]. TextFileToDocument already handles this correctly via an encoding constructor param and a bytestream.meta.get("encoding", self.encoding) fallback.

This PR brings both converters into line with TextFileToDocument:

Add an encoding: str = "utf-8" parameter to HTMLToDocument.__init__ and MarkdownToDocument.__init__
Change the hardcoded .decode("utf-8") calls to bytestream.meta.get("encoding", self.encoding) in the run() method of each converter
Include encoding in HTMLToDocument.to_dict() so the param survives serialization round-trips (.from_dict() already routes through default_from_dict which handles it automatically)
Add unit tests (test_bytestream_encoding_from_meta and test_bytestream_encoding_from_init) for both converters using a latin-1 ByteStream

Reproduction (before fix):

from haystack.components.converters import HTMLToDocument
from haystack.dataclasses import ByteStream

bs = ByteStream(b"<p>caf\xe9</p>", meta={"encoding": "latin-1"})
HTMLToDocument().run(sources=[bs])
# UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9

This PR was fully generated with an AI assistant. I have reviewed the changes and run the relevant tests (29 passed, 0 failed).

Test plan

pytest test/components/converters/test_html_to_document.py test/components/converters/test_markdown_to_document.py -k "not integration" — 29 passed
New tests test_bytestream_encoding_from_meta and test_bytestream_encoding_from_init added for both HTMLToDocument and MarkdownToDocument
Existing test_serde updated to assert encoding round-trips through to_dict / from_dict

…ment Both converters hardcoded `.decode("utf-8")`, silently corrupting or raising UnicodeDecodeError for non-UTF-8 sources that supply an encoding via ByteStream.meta["encoding"]. This mirrors the existing pattern in TextFileToDocument: - Add `encoding` param to `__init__` (default "utf-8") in both converters - Use `bytestream.meta.get("encoding", self.encoding)` at decode time - Include `encoding` in HTMLToDocument.to_dict() / from_dict() round-trip - Add unit tests for latin-1 ByteStreams via meta["encoding"] and via __init__

vercel · 2026-06-26T16:56:03Z

@greymoth-jp is attempting to deploy a commit to the deepset Team on Vercel.

A member of the Team first needs to authorize it.

CLAassistant · 2026-06-26T16:56:06Z

All committers have signed the CLA.

greymoth-jp requested a review from a team as a code owner June 26, 2026 16:55

greymoth-jp requested review from anakin87 and removed request for a team June 26, 2026 16:55

github-actions Bot added the topic:tests label Jun 26, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: respect ByteStream encoding in HTMLToDocument and MarkdownToDocument#11792

fix: respect ByteStream encoding in HTMLToDocument and MarkdownToDocument#11792
greymoth-jp wants to merge 1 commit into
deepset-ai:mainfrom
greymoth-jp:greymoth-jp/fix/html-md-converter-encoding

greymoth-jp commented Jun 26, 2026

Uh oh!

vercel Bot commented Jun 26, 2026

Uh oh!

CLAassistant commented Jun 26, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

greymoth-jp commented Jun 26, 2026

Summary

Test plan

Uh oh!

vercel Bot commented Jun 26, 2026

Uh oh!

CLAassistant commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

CLAassistant commented Jun 26, 2026 •

edited

Loading