TIKA-4766: typed Document parse contract for tika-grpc#2921
Open
krickert wants to merge 1 commit into
Open
Conversation
c1de042 to
1d2d5bd
Compare
4 tasks
…t contract One small, stable proto instead of a message per source format. Contract (tika-grpc-api, ~200 lines of proto): - Document: content as a structured markdown block tree (headings, paragraphs, lists, tables, code blocks, inline runs -- CommonMark + GFM, the same markdown ToMarkdownContentHandler already renders), plus `markdown` as the authoritative rendered form. - DocumentMetadata: a small bounded set of typed common fields grouped by concern, not by source format (title/authors/keywords/languages, created and modified as Timestamps, page/word/character counts, dimensions, rights). - Tagged tail: `extra` carries every remaining Tika key losslessly and multivalue-preserving, typed only where Tika's own Property declares a type (integer/real/boolean/date) and string otherwise -- never guessed. - `embedded` recurses: a parent PDF and an embedded image are each a fully typed Document. - `format_category` is a cheap routing hint; cross-cutting concerns such as Creative Commons rights coexist with it rather than fighting a oneof. Format specifics live in per-parser DocumentTransformer code (tika-grpc-mapper), not in the wire contract: adding or changing a parser never touches the proto, and metadata churn lands in the mapper and the tagged tail, so a new or renamed metadata key never forces a client rebuild. Server (tika-grpc): - FetchAndParseReply.document (field 5) replaces the fields map (field 2, now reserved). TikaGrpcServerImpl maps parse output via DocumentBuilder. Modules: tika-grpc-api (proto + generated messages + bundled FileDescriptorSet), tika-grpc-mapper (DocumentBuilder, per-format transformers, markdown block-tree builder), both listed in tika-bom. Tests: transformer tests against real parse fixtures per format, block tree round-trip tests, DocumentBuilder envelope/status/embedded tests, server tests updated to read FetchAndParseReply.document.
1d2d5bd to
da51be9
Compare
Contributor
Author
Contributor
Author
|
@nddipiazza @tballison what do you think of this design instead? far less fields - ability to roll your own protobuf model in the future. Best of both worlds. Document structure is very markdown-friendly. I'll make the output of this be able to be the input of the grpc OpenNLP grpc server. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Follow-up to #2916, reshaped per the review there. Instead of mirroring Tika's open metadata taxonomy in protobuf (~5k lines of proto, per-format messages), this PR types the thing that is actually stable: the parsed document. One small contract —
document.protois 208 lines — and format specifics live in per-parser mapping code, never in the wire.FetchAndParseReply.fields(map<string,string>, field 2, now reserved) is replaced byFetchAndParseReply.document.How this answers the #2916 review
extratail), not schema — add/rename/retype a TikaPropertyand no client regenerates anythingDocument.extracarries every Tika key, multivalue-preserving; a test asserts nothing is droppedDocumentMetadatatypes only the bounded cross-format fields (title, authors, dates asTimestamp, counts, dimensions, rights)The shape
markdown(the same renderToMarkdownContentHandleralready produces since TIKA-4730) plusblocks— that markdown parsed once, format-agnostically, into a structured tree of headings/paragraphs/lists/tables/code blocks/inline runs (CommonMark + GFM, a spec that does not churn). This is what a downstream NLP/RAG/embeddings consumer actually wants: typed tables and sections, not a string to re-parse.DocumentMetadata, grouped by concern, not by source format. Dates areTimestamps, counts are ints — not strings that 12 language clients each re-parse.extra— every remaining key, typed only where Tika's ownPropertydeclares a type (integer/real/boolean/date), string otherwise, never guessed.embeddedrecurses: a PDF with an embedded image is a parentDocumentwith a fully typed child — no forcing two formats into one bucket (this was the oneof problem from TIKA-4766: Typed parse response grpc #2916).DocumentTransformer(seetika-grpc-mapper/docs/EXTENSIONS.md);PdfDocumentTransformeris 65 lines and the wire contract does not move.Deliberately not in this PR (follow-ups, each its own PR)
Documentas agoogle.protobuf.Any— so wildly different result shapes (e.g. a document-layout model's tree) never require Tika to model them. Built and tested on a branch; kept out to keep this reviewable..mdinput files (separate JIRA).optionalfields, compatible both directions.Open decisions where reviewer preference wins
repeated MetadataFieldwith a typed value oneof (as implemented) vs themap<string, StringList>suggested in TIKA-4766: Typed parse response grpc #2916. The typed-where-declared tail preserves types without guessing; the map is maximally churn-proof. Swapping is a one-message change — happy to go either way.markdown+blocksboth: today both ship (string render + structured tree). If payload size matters, a per-request flag choosing one is easy.fieldsis hard-removed (4.0, nothing consumes it yet); can switch to deprecate-then-remove if preferred.Client migration
fields["X-TIKA:content"]document.markdown(or walkdocument.blocks)fields["Content-Type"]document.content_typedocument.metadata.title/.authors/.created(Timestamp)document.extra(typed by declaredPropertytype, string otherwise)Test plan
./mvnw -pl tika-grpc-api,tika-grpc-mapper,tika-grpc test— green (transformer tests against real parse fixtures per format, block-tree tests,DocumentBuilderenvelope/status/embedded tests, server tests readingFetchAndParseReply.document)tika-grpc-apijar bundlesMETA-INF/org.apache.tika.grpc.v1.descriptors(verified: containsdocument.proto)tika-grpc-e2e-testcompiles against the new APIDownstream context: this contract is what the OpenNLP gRPC work (OPENNLP-1833) will consume as input — Tika parse → typed document → NLP/embeddings without re-parsing strings.