Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
This PR (TIKA-4732) refines UNPACK/FRICTIONLESS output semantics so the “original/container” document is represented consistently as embedded id 0 (e.g., unpacked/0.<ext>) and ensures the Frictionless manifest’s name reflects the caller-supplied filename (rather than an internal/spooled name). It also adds/updates tests to lock in the expected Frictionless zip shape and manifest-to-archive consistency.
Changes:
- Update pipes UNPACK zip/frictionless emission to rely on
ParseHandler._preParseadding the container as embedded id0(removing separate “original document” handling). - Preserve
TikaCoreProperties.RESOURCE_NAME_KEYacross the “fresh metadata” boundary inPipesWorkerso detection/suffix selection and Frictionlessnameuse the logical filename. - Add/extend server + integration tests validating Frictionless zip structure, manifest correctness, and includeOriginal/container behavior.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
tika-server/tika-server-standard/src/test/java/org/apache/tika/server/standard/UnpackerResourceTest.java |
Adds Frictionless-focused server tests for manifest/resources accuracy, container naming, and /unpack vs /unpack/all behavior. |
tika-pipes/tika-pipes-integration-tests/src/test/java/org/apache/tika/pipes/core/FrictionlessUnpackTest.java |
Tightens integration assertions around includeOriginal=true behavior in Frictionless zips (container as unpacked/0.*, manifest paths constrained to unpacked/). |
tika-pipes/tika-pipes-core/src/main/java/org/apache/tika/pipes/core/server/PipesWorker.java |
Removes separate original-document emission and copies RESOURCE_NAME_KEY into the new metadata used during fetch/parse. |
tika-pipes/tika-pipes-core/src/main/java/org/apache/tika/pipes/core/extractor/TempFileUnpackHandler.java |
Removes stored-original-document fields/methods; handler now only tracks embedded entries written during add(). |
tika-pipes/tika-pipes-core/src/main/java/org/apache/tika/pipes/core/extractor/FrictionlessUnpackHandler.java |
Removes stored-original-document support and relies on embedded id 0 carrying the container; manifest name now comes from RESOURCE_NAME_KEY in per-file metadata. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Thanks for your contribution to Apache Tika! Your help is appreciated!
Before opening the pull request, please verify that
TIKA-XXXX)[TIKA-XXXX] Issue or pull request title)./mvnw clean testmainbranch. If there are conflicts, please try to rebase the pull request branch on top of a freshly pulledmainbranchtika-bom/pom.xml.We will be able to faster integrate your pull request if these conditions are met. If you have any questions how to fix your problem or about using Tika in general, please sign up for the Tika mailing list. Thanks!