Skip to content

TIKA-4770: Add a Markdown parser with structured, lossless XHTML output#2922

Merged
tballison merged 3 commits into
apache:mainfrom
ai-pipestream:markdown-parser
Jul 2, 2026
Merged

TIKA-4770: Add a Markdown parser with structured, lossless XHTML output#2922
tballison merged 3 commits into
apache:mainfrom
ai-pipestream:markdown-parser

Conversation

@krickert

@krickert krickert commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Summary

.md files are already detected as text/markdown (globs in tika-mimetypes.xml), but no parser claims the type, so they fall through to TXTParser and come back as flat text — headings, tables, and code fences all collapse into an undifferentiated string.

This adds a MarkdownParser to tika-parser-text-module using commonmark-java — already a Tika dependency (it backs ToMarkdownContentHandler, TIKA-4730) — that parses the markdown AST and emits structured XHTML:

Markdown XHTML
#..###### / setext h1..h6
lists (incl. GFM tight lists) ul / ol (with start when not 1) / li
fenced / indented code pre/code with class="language-x" (+ data-info for any extra fence info)
GFM tables table/thead/tbody/tr/th/td with align
emphasis / strong / GFM strikethrough em / strong / del
links, images a href title, img src alt title
block quotes, thematic breaks blockquote, hr

Fidelity and safety

  • No content loss: every literal the commonmark AST carries reaches the output — including image alt text with code spans, ordered-list start numbers, and full code-fence info strings. Only markdown syntax presentation (bullet/fence/emphasis delimiter characters) is normalized, identical to commonmark's reference HtmlRenderer.
  • Raw HTML in the source is emitted as escaped text — preserved, but never injected into the XHTML stream.
  • Encoding detection via AutoDetectReader, the same idiom as TXTParser; detected charset lands in Content-Type/Content-Encoding.
  • Registered via @TikaComponent, same as the other text-module parsers. No MIME changes needed.

Because the emitted vocabulary matches what ToMarkdownContentHandler consumes, a markdown document round-trips markdown → XHTML → markdown (there's a test for it).

Relationship to other work

Independent of the gRPC Document-contract PR (#2921) — this is the input direction (.md files into Tika); that PR is the output direction. They share only the commonmark library.

Test plan

  • MarkdownParserTest: 9 tests — structure, GFM tables with alignment sections, raw-HTML escaping, ordered-list start numbers, code-span alt text, fence info preservation, charset detection with non-ASCII content, markdown round-trip
  • full tika-parser-text-module test suite green (no regressions in TXT/CSV parsers)
  • apache-rat:check green
  • CI

…output

Markdown files (text/markdown, already detected by glob in tika-mimetypes)
previously fell through to TXTParser and came back as flat text. This adds
a dedicated MarkdownParser using commonmark-java (the library already behind
ToMarkdownContentHandler) that emits structured XHTML: h1-h6, ul/ol/li,
blockquote, pre/code, GFM tables as table/thead/tbody/tr/th/td with
alignment, em/strong/del, links, images, and hr.

Fidelity: every piece of content the commonmark AST carries is preserved --
text/code literals (raw HTML as escaped text, so nothing can be injected),
link and image destinations and titles, image alt text including code spans,
heading levels, table cell alignment and header cells, ordered-list start
numbers (<ol start=...>), the full code-fence info string (class="language-x"
plus data-info when the fence carries more than a language token), and
hard/soft line breaks. Only markdown syntax presentation (bullet/fence/
emphasis delimiter characters, ATX vs setext headings) is normalized, the
same normalization as commonmark's reference HtmlRenderer.

Encoding is detected via AutoDetectReader, matching TXTParser; the detected
charset lands in Content-Type and Content-Encoding.
@tballison

Copy link
Copy Markdown
Contributor

LGTM. Some requests:

  • Please tersify comments
  • Try to use the existing RuntimeSAXException
  • Use TikaTest instead of handrolling toXhtml (please use the AutoDetectParser not the specific new parser...may need to pass in dummy file name to get detection to work?)

@tballison

Copy link
Copy Markdown
Contributor

And, if your agent has time, may as well add handling for embedded script and data uris like we have in the html parser. We may want to move DataURIScheme/DataURISchemeUtil into tika-core or into a shared *-commons module, probably tika-core?

@krickert

krickert commented Jul 2, 2026

Copy link
Copy Markdown
Contributor Author

Agent pushed back rather dramatically. But humans aren't as sensitive, so I overrode the pushback. It'll be heavily reviewed - I'll reply when it's ready.

- Use the existing RuntimeSAXException instead of a bespoke wrapper.
- Tersify comments.
- Rewrite MarkdownParserTest on TikaTest, parsing through AutoDetectParser
  (dummy .md resource name for glob detection) with fixture files under
  test-documents; this also exercises component registration and routing.
- Extract data: URIs as embedded documents, as the html module does:
  image/link destinations are parsed directly, and raw HTML blocks/inline
  (e.g. script tags) are scraped with DataURISchemeUtil.extract. INLINE
  embedded resource type, gated by EmbeddedDocumentExtractor.
- Move DataURIScheme/DataURISchemeUtil/DataURISchemeParseException (and
  their test) from tika-parser-html-module to org.apache.tika.utils in
  tika-core so both parsers share them. tika-core has no commons-codec,
  so the base64 decode now uses java.util.Base64.getMimeDecoder(), which
  is equally lenient about whitespace/non-alphabet characters; truly
  malformed base64 now throws DataURISchemeParseException from parse()
  and is skipped by extract(), where commons-codec silently best-effort
  decoded.
@krickert

krickert commented Jul 2, 2026

Copy link
Copy Markdown
Contributor Author

All four done in 175b9e6:

  • Comments — tersified. Class javadoc is down to four lines.
  • RuntimeSAXException — swapped in for the bespoke wrapper; visitor throws it, parse() unwraps, same idiom as the MP4/ASM parsers.
  • Tests — rewritten on TikaTest, and every parse now goes through AUTO_DETECT_PARSER with a dummy .md resource name (glob-only detection, no magic for markdown). Fixtures live under test-documents/. Nice side effect: the tests now exercise component registration and MIME routing end to end, not just the parser class.
  • Data URIs / embedded scripts — (agent) took you up on the move: DataURIScheme, DataURISchemeUtil, and DataURISchemeParseException (plus their test) are now in org.apache.tika.utils in tika-core, and the html module imports them from there. MarkdownParser mirrors HtmlHandler: data: image/link destinations are parsed directly, raw HTML blocks/inline (script tags included) are scraped with extract(), and results flow through EmbeddedDocumentExtractor as INLINE embedded docs. There's a recursive-metadata test showing a markdown file with a data-URI image + a script-embedded data URI yielding three documents.

One behavior change I'd like you to weigh-in on:

tika-core doesn't have commons-codec, so the base64 decode now uses java.util.Base64.getMimeDecoder(). Same leniency for whitespace/newlines/backslash-continuations (existing tests pass unchanged), but truly malformed base64 now throws DataURISchemeParseException from parse() (both callers already catch it) and gets skipped by extract() - where commons-codec used to silently best-effort decode.

I think failing loudly there is the way, do you agree?

CI should work fine, I verified the tests locally: tika-core 4/4 on the moved test, html module 59/0, text module green, checkstyle + RAT clean on all three modules.

@tballison

tballison commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Ah, right commons-codec....

My memory is that commons-codec is more robust against noisy data than the jdk. Sometimes, we could get some bytes out before jdk would throw.

My claude just spent 4 tries arguing for commons-codec and then jdk and then commons-codec again.

On this try, claude agreed with my confirmation bias.

Commons-codec is never strictly worse for extraction and is sometimes much better; the JDK is never better and is sometimes much worse.

So, let's move this util to a small -commons module in tika-parsers-standard and rely on commons-codec there.

@tballison

Copy link
Copy Markdown
Contributor

Agent pushed back rather dramatically try a different model?

What were the concerns?

Per review: extraction wants commons-codec's lenient base64 (salvage bytes
from noisy data) rather than the JDK decoder's strictness, and tika-core
has no commons-codec. So the DataURIScheme classes move out of tika-core
into a new tika-parser-datauri-commons module (same pattern as
digest/jdbc/mail/xmp/zip-commons), package org.apache.tika.parser.datauri,
with the original commons-codec decode restored verbatim. The html and
text modules depend on it (replacing their direct commons-codec deps,
which are otherwise unused); listed in tika-bom.
@krickert

krickert commented Jul 2, 2026

Copy link
Copy Markdown
Contributor Author

Agent pushed back rather dramatically try a different model?

What were the concerns?

It wasn't dramatic - but just minor and probably because of me.

what changed

First:

The requested changes are done in 57bf60c

tika-parser-datauri-commons, has same shape as the digest/zip/mail-commons siblings.

Package org.apache.tika.parser.datauri, commons-codec decode restored verbatim. html and text modules depend on it (their direct commons-codec deps were otherwise unused, so those are gone); it's in tika-bom. Transitive into the standard package via both consumers, same as zip-commons.

The concern

My take is nearly always to err on the side of "fail fast". Claude probably caught onto this pattern.

AI response:

The strict-decode case was: silently best-effort-decoding malformed base64 hands downstream consumers garbage bytes labeled with a confident content-type, with no signal anything was wrong — in a validation context that's a bug factory. The counter (yours) is that Tika isn't a validation context: it's salvage. A truncated data URI in a crawled page still has a decodable prefix, and for extraction, partial bytes beat a clean exception every time. Once you frame it as "who is the consumer of the failure," extraction wins and commons-codec is the right tool. I'd still want strictness if this util ever guards an ingest boundary, but that's not what Tika is for — conceded.

My opinion - go either way. I'd vote strict because I've worked with a lot of bad URLs and none of the parsers are 100% right. There's always some strange URLs that even gave be security headaches in the past. But that can be solved with my own validator - so not a blocker.

@tballison

Copy link
Copy Markdown
Contributor

"fail fast" makes sense in most circumstances. However, for parsing, my personal preference is to get as much as we possibly (reliably) can out of files. If we're able to get anything useful out of a byte[] even if truncated, we should try.

@krickert

krickert commented Jul 2, 2026

Copy link
Copy Markdown
Contributor Author

On that note - anything else needed for a merge?

@tballison tballison merged commit aca20dc into apache:main Jul 2, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants