Skip to content

TIKA-4766: typed Document parse contract for tika-grpc#2921

Open
krickert wants to merge 1 commit into
apache:mainfrom
ai-pipestream:TIKA-4766-document-contract
Open

TIKA-4766: typed Document parse contract for tika-grpc#2921
krickert wants to merge 1 commit into
apache:mainfrom
ai-pipestream:TIKA-4766-document-contract

Conversation

@krickert

@krickert krickert commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

Summary

Follow-up to #2916, reshaped per the review there. Instead of mirroring Tika's open metadata taxonomy in protobuf (~5k lines of proto, per-format messages), this PR types the thing that is actually stable: the parsed document. One small contract — document.proto is 208 lines — and format specifics live in per-parser mapping code, never in the wire.

FetchAndParseReply.fields (map<string,string>, field 2, now reserved) is replaced by FetchAndParseReply.document.

How this answers the #2916 review

Concern from #2916 Where it landed
"11k lines to nail down maybe 80% of an open set" 208-line contract; whole PR is ~3.9k lines, most of it mapper code + tests
Clients rebuild when metadata definitions change A metadata key is now data (extra tail), not schema — add/rename/retype a Tika Property and no client regenerates anything
Lossless catch-all as source of truth Document.extra carries every Tika key, multivalue-preserving; a test asserts nothing is dropped
Special handling only for DC + core props DocumentMetadata types only the bounded cross-format fields (title, authors, dates as Timestamp, counts, dimensions, rights)
Break it into individually reviewable tasks This PR is the contract only; see "Deliberately not in this PR"

The shape

  1. Content tree: markdown (the same render ToMarkdownContentHandler already produces since TIKA-4730) plus blocks — that markdown parsed once, format-agnostically, into a structured tree of headings/paragraphs/lists/tables/code blocks/inline runs (CommonMark + GFM, a spec that does not churn). This is what a downstream NLP/RAG/embeddings consumer actually wants: typed tables and sections, not a string to re-parse.
  2. Typed common metadata: DocumentMetadata, grouped by concern, not by source format. Dates are Timestamps, counts are ints — not strings that 12 language clients each re-parse.
  3. Tagged tail: extra — every remaining key, typed only where Tika's own Property declares a type (integer/real/boolean/date), string otherwise, never guessed.
  4. embedded recurses: a PDF with an embedded image is a parent Document with a fully typed child — no forcing two formats into one bucket (this was the oneof problem from TIKA-4766: Typed parse response grpc #2916).
  5. Adding a format = adding a DocumentTransformer (see tika-grpc-mapper/docs/EXTENSIONS.md); PdfDocumentTransformer is 65 lines and the wire contract does not move.

Deliberately not in this PR (follow-ups, each its own PR)

  • Pluggable external parsers: registering a third-party gRPC service whose output rides along on the Document as a google.protobuf.Any — so wildly different result shapes (e.g. a document-layout model's tree) never require Tika to model them. Built and tested on a branch; kept out to keep this reviewable.
  • A Markdown parser for .md input files (separate JIRA).
  • Richer typed fields, if and only if real cross-format demand appears — they'd be additive optional fields, compatible both directions.

Open decisions where reviewer preference wins

  1. Tail shape: repeated MetadataField with a typed value oneof (as implemented) vs the map<string, StringList> suggested in TIKA-4766: Typed parse response grpc #2916. The typed-where-declared tail preserves types without guessing; the map is maximally churn-proof. Swapping is a one-message change — happy to go either way.
  2. markdown + blocks both: today both ship (string render + structured tree). If payload size matters, a per-request flag choosing one is easy.
  3. Hard removal vs staged: fields is hard-removed (4.0, nothing consumes it yet); can switch to deprecate-then-remove if preferred.

Client migration

Before After
fields["X-TIKA:content"] document.markdown (or walk document.blocks)
fields["Content-Type"] document.content_type
Ad hoc title/author/date strings document.metadata.title / .authors / .created (Timestamp)
Any other key document.extra (typed by declared Property type, string otherwise)

Test plan

  • ./mvnw -pl tika-grpc-api,tika-grpc-mapper,tika-grpc test — green (transformer tests against real parse fixtures per format, block-tree tests, DocumentBuilder envelope/status/embedded tests, server tests reading FetchAndParseReply.document)
  • tika-grpc-api jar bundles META-INF/org.apache.tika.grpc.v1.descriptors (verified: contains document.proto)
  • e2e tika-grpc-e2e-test compiles against the new API
  • CI

Downstream context: this contract is what the OpenNLP gRPC work (OPENNLP-1833) will consume as input — Tika parse → typed document → NLP/embeddings without re-parsing strings.

@krickert krickert force-pushed the TIKA-4766-document-contract branch 2 times, most recently from c1de042 to 1d2d5bd Compare July 1, 2026 23:55
…t contract

One small, stable proto instead of a message per source format.

Contract (tika-grpc-api, ~200 lines of proto):
- Document: content as a structured markdown block tree (headings,
  paragraphs, lists, tables, code blocks, inline runs -- CommonMark + GFM,
  the same markdown ToMarkdownContentHandler already renders), plus
  `markdown` as the authoritative rendered form.
- DocumentMetadata: a small bounded set of typed common fields grouped by
  concern, not by source format (title/authors/keywords/languages, created
  and modified as Timestamps, page/word/character counts, dimensions,
  rights).
- Tagged tail: `extra` carries every remaining Tika key losslessly and
  multivalue-preserving, typed only where Tika's own Property declares a
  type (integer/real/boolean/date) and string otherwise -- never guessed.
- `embedded` recurses: a parent PDF and an embedded image are each a fully
  typed Document.
- `format_category` is a cheap routing hint; cross-cutting concerns such
  as Creative Commons rights coexist with it rather than fighting a oneof.

Format specifics live in per-parser DocumentTransformer code
(tika-grpc-mapper), not in the wire contract: adding or changing a parser
never touches the proto, and metadata churn lands in the mapper and the
tagged tail, so a new or renamed metadata key never forces a client
rebuild.

Server (tika-grpc):
- FetchAndParseReply.document (field 5) replaces the fields map (field 2,
  now reserved). TikaGrpcServerImpl maps parse output via DocumentBuilder.

Modules: tika-grpc-api (proto + generated messages + bundled
FileDescriptorSet), tika-grpc-mapper (DocumentBuilder, per-format
transformers, markdown block-tree builder), both listed in tika-bom.

Tests: transformer tests against real parse fixtures per format, block
tree round-trip tests, DocumentBuilder envelope/status/embedded tests,
server tests updated to read FetchAndParseReply.document.
@krickert krickert force-pushed the TIKA-4766-document-contract branch from 1d2d5bd to da51be9 Compare July 2, 2026 01:41
@krickert

krickert commented Jul 2, 2026

Copy link
Copy Markdown
Contributor Author

Follow-up work is broken out into TIKA-4771 (pluggable external parsers) and TIKA-4772 (document event streaming) rather than growing this PR.

@krickert

krickert commented Jul 3, 2026

Copy link
Copy Markdown
Contributor Author

@nddipiazza @tballison what do you think of this design instead? far less fields - ability to roll your own protobuf model in the future. Best of both worlds. Document structure is very markdown-friendly. I'll make the output of this be able to be the input of the grpc OpenNLP grpc server.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant