TIKA-4766: typed Document parse contract for tika-grpc by krickert · Pull Request #2921 · apache/tika

krickert · 2026-07-01T22:38:33Z

Summary

Follow-up to #2916, reshaped per the review there. Instead of mirroring Tika's open metadata taxonomy in protobuf (~5k lines of proto, per-format messages), this PR types the thing that is actually stable: the parsed document. One small contract — document.proto is 208 lines — and format specifics live in per-parser mapping code, never in the wire.

FetchAndParseReply.fields (map<string,string>, field 2, now reserved) is replaced by FetchAndParseReply.document.

How this answers the #2916 review

Concern from #2916	Where it landed
"11k lines to nail down maybe 80% of an open set"	208-line contract; whole PR is ~3.9k lines, most of it mapper code + tests
Clients rebuild when metadata definitions change	A metadata key is now data (`extra` tail), not schema — add/rename/retype a Tika `Property` and no client regenerates anything
Lossless catch-all as source of truth	`Document.extra` carries every Tika key, multivalue-preserving; a test asserts nothing is dropped
Special handling only for DC + core props	`DocumentMetadata` types only the bounded cross-format fields (title, authors, dates as `Timestamp`, counts, dimensions, rights)
Break it into individually reviewable tasks	This PR is the contract only; see "Deliberately not in this PR"

The shape

Content tree: markdown (the same render ToMarkdownContentHandler already produces since TIKA-4730) plus blocks — that markdown parsed once, format-agnostically, into a structured tree of headings/paragraphs/lists/tables/code blocks/inline runs (CommonMark + GFM, a spec that does not churn). This is what a downstream NLP/RAG/embeddings consumer actually wants: typed tables and sections, not a string to re-parse.
Typed common metadata: DocumentMetadata, grouped by concern, not by source format. Dates are Timestamps, counts are ints — not strings that 12 language clients each re-parse.
Tagged tail: extra — every remaining key, typed only where Tika's own Property declares a type (integer/real/boolean/date), string otherwise, never guessed.
embedded recurses: a PDF with an embedded image is a parent Document with a fully typed child — no forcing two formats into one bucket (this was the oneof problem from TIKA-4766: Typed parse response grpc #2916).
Adding a format = adding a DocumentTransformer (see tika-grpc-mapper/docs/EXTENSIONS.md); PdfDocumentTransformer is 65 lines and the wire contract does not move.

Deliberately not in this PR (follow-ups, each its own PR)

Pluggable external parsers: registering a third-party gRPC service whose output rides along on the Document as a google.protobuf.Any — so wildly different result shapes (e.g. a document-layout model's tree) never require Tika to model them. Built and tested on a branch; kept out to keep this reviewable.
A Markdown parser for .md input files (separate JIRA).
Richer typed fields, if and only if real cross-format demand appears — they'd be additive optional fields, compatible both directions.

Open decisions where reviewer preference wins

Tail shape: repeated MetadataField with a typed value oneof (as implemented) vs the map<string, StringList> suggested in TIKA-4766: Typed parse response grpc #2916. The typed-where-declared tail preserves types without guessing; the map is maximally churn-proof. Swapping is a one-message change — happy to go either way.
markdown + blocks both: today both ship (string render + structured tree). If payload size matters, a per-request flag choosing one is easy.
Hard removal vs staged: fields is hard-removed (4.0, nothing consumes it yet); can switch to deprecate-then-remove if preferred.

Client migration

Before	After
`fields["X-TIKA:content"]`	`document.markdown` (or walk `document.blocks`)
`fields["Content-Type"]`	`document.content_type`
Ad hoc title/author/date strings	`document.metadata.title` / `.authors` / `.created` (`Timestamp`)
Any other key	`document.extra` (typed by declared `Property` type, string otherwise)

Test plan

./mvnw -pl tika-grpc-api,tika-grpc-mapper,tika-grpc test — green (transformer tests against real parse fixtures per format, block-tree tests, DocumentBuilder envelope/status/embedded tests, server tests reading FetchAndParseReply.document)
tika-grpc-api jar bundles META-INF/org.apache.tika.grpc.v1.descriptors (verified: contains document.proto)
e2e tika-grpc-e2e-test compiles against the new API
CI

Downstream context: this contract is what the OpenNLP gRPC work (OPENNLP-1833) will consume as input — Tika parse → typed document → NLP/embeddings without re-parsing strings.

…t contract One small, stable proto instead of a message per source format. Contract (tika-grpc-api, ~200 lines of proto): - Document: content as a structured markdown block tree (headings, paragraphs, lists, tables, code blocks, inline runs -- CommonMark + GFM, the same markdown ToMarkdownContentHandler already renders), plus `markdown` as the authoritative rendered form. - DocumentMetadata: a small bounded set of typed common fields grouped by concern, not by source format (title/authors/keywords/languages, created and modified as Timestamps, page/word/character counts, dimensions, rights). - Tagged tail: `extra` carries every remaining Tika key losslessly and multivalue-preserving, typed only where Tika's own Property declares a type (integer/real/boolean/date) and string otherwise -- never guessed. - `embedded` recurses: a parent PDF and an embedded image are each a fully typed Document. - `format_category` is a cheap routing hint; cross-cutting concerns such as Creative Commons rights coexist with it rather than fighting a oneof. Format specifics live in per-parser DocumentTransformer code (tika-grpc-mapper), not in the wire contract: adding or changing a parser never touches the proto, and metadata churn lands in the mapper and the tagged tail, so a new or renamed metadata key never forces a client rebuild. Server (tika-grpc): - FetchAndParseReply.document (field 5) replaces the fields map (field 2, now reserved). TikaGrpcServerImpl maps parse output via DocumentBuilder. Modules: tika-grpc-api (proto + generated messages + bundled FileDescriptorSet), tika-grpc-mapper (DocumentBuilder, per-format transformers, markdown block-tree builder), both listed in tika-bom. Tests: transformer tests against real parse fixtures per format, block tree round-trip tests, DocumentBuilder envelope/status/embedded tests, server tests updated to read FetchAndParseReply.document.

krickert · 2026-07-02T12:24:35Z

Follow-up work is broken out into TIKA-4771 (pluggable external parsers) and TIKA-4772 (document event streaming) rather than growing this PR.

krickert · 2026-07-03T04:38:27Z

@nddipiazza @tballison what do you think of this design instead? far less fields - ability to roll your own protobuf model in the future. Best of both worlds. Document structure is very markdown-friendly. I'll make the output of this be able to be the input of the grpc OpenNLP grpc server.

krickert force-pushed the TIKA-4766-document-contract branch 2 times, most recently from c1de042 to 1d2d5bd Compare July 1, 2026 23:55

krickert mentioned this pull request Jul 2, 2026

TIKA-4770: Add a Markdown parser with structured, lossless XHTML output #2922

Merged

4 tasks

krickert force-pushed the TIKA-4766-document-contract branch from 1d2d5bd to da51be9 Compare July 2, 2026 01:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

TIKA-4766: typed Document parse contract for tika-grpc#2921

TIKA-4766: typed Document parse contract for tika-grpc#2921
krickert wants to merge 1 commit into
apache:mainfrom
ai-pipestream:TIKA-4766-document-contract

krickert commented Jul 1, 2026

Uh oh!

krickert commented Jul 2, 2026

Uh oh!

krickert commented Jul 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

krickert commented Jul 1, 2026

Summary

How this answers the #2916 review

The shape

Deliberately not in this PR (follow-ups, each its own PR)

Open decisions where reviewer preference wins

Client migration

Test plan

Uh oh!

krickert commented Jul 2, 2026

Uh oh!

krickert commented Jul 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant