Skip to content

fix: respect ByteStream encoding in HTMLToDocument and MarkdownToDocument#11792

Open
greymoth-jp wants to merge 1 commit into
deepset-ai:mainfrom
greymoth-jp:greymoth-jp/fix/html-md-converter-encoding
Open

fix: respect ByteStream encoding in HTMLToDocument and MarkdownToDocument#11792
greymoth-jp wants to merge 1 commit into
deepset-ai:mainfrom
greymoth-jp:greymoth-jp/fix/html-md-converter-encoding

Conversation

@greymoth-jp

Copy link
Copy Markdown

Summary

HTMLToDocument and MarkdownToDocument both hardcode .decode("utf-8"), which raises UnicodeDecodeError (or silently corrupts output) for any non-UTF-8 ByteStream whose encoding is passed via ByteStream.meta["encoding"]. TextFileToDocument already handles this correctly via an encoding constructor param and a bytestream.meta.get("encoding", self.encoding) fallback.

This PR brings both converters into line with TextFileToDocument:

  • Add an encoding: str = "utf-8" parameter to HTMLToDocument.__init__ and MarkdownToDocument.__init__
  • Change the hardcoded .decode("utf-8") calls to bytestream.meta.get("encoding", self.encoding) in the run() method of each converter
  • Include encoding in HTMLToDocument.to_dict() so the param survives serialization round-trips (.from_dict() already routes through default_from_dict which handles it automatically)
  • Add unit tests (test_bytestream_encoding_from_meta and test_bytestream_encoding_from_init) for both converters using a latin-1 ByteStream

Reproduction (before fix):

from haystack.components.converters import HTMLToDocument
from haystack.dataclasses import ByteStream

bs = ByteStream(b"<p>caf\xe9</p>", meta={"encoding": "latin-1"})
HTMLToDocument().run(sources=[bs])
# UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9

This PR was fully generated with an AI assistant. I have reviewed the changes and run the relevant tests (29 passed, 0 failed).

Test plan

  • pytest test/components/converters/test_html_to_document.py test/components/converters/test_markdown_to_document.py -k "not integration" — 29 passed
  • New tests test_bytestream_encoding_from_meta and test_bytestream_encoding_from_init added for both HTMLToDocument and MarkdownToDocument
  • Existing test_serde updated to assert encoding round-trips through to_dict / from_dict

…ment

Both converters hardcoded `.decode("utf-8")`, silently corrupting or
raising UnicodeDecodeError for non-UTF-8 sources that supply an encoding
via ByteStream.meta["encoding"]. This mirrors the existing pattern in
TextFileToDocument:

- Add `encoding` param to `__init__` (default "utf-8") in both converters
- Use `bytestream.meta.get("encoding", self.encoding)` at decode time
- Include `encoding` in HTMLToDocument.to_dict() / from_dict() round-trip
- Add unit tests for latin-1 ByteStreams via meta["encoding"] and via __init__
@greymoth-jp greymoth-jp requested a review from a team as a code owner June 26, 2026 16:55
@greymoth-jp greymoth-jp requested review from anakin87 and removed request for a team June 26, 2026 16:55
@vercel

vercel Bot commented Jun 26, 2026

Copy link
Copy Markdown

@greymoth-jp is attempting to deploy a commit to the deepset Team on Vercel.

A member of the Team first needs to authorize it.

@CLAassistant

CLAassistant commented Jun 26, 2026

Copy link
Copy Markdown

CLA assistant check
All committers have signed the CLA.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants