Skip to content

TIKA-4732#2820

Merged
tballison merged 4 commits into
mainfrom
TIKA-4732
May 18, 2026
Merged

TIKA-4732#2820
tballison merged 4 commits into
mainfrom
TIKA-4732

Conversation

@tballison
Copy link
Copy Markdown
Contributor

Thanks for your contribution to Apache Tika! Your help is appreciated!

Before opening the pull request, please verify that

  • there is an open issue on the Tika issue tracker which describes the problem or the improvement. We cannot accept pull requests without an issue because the change wouldn't be listed in the release notes.
  • the issue ID (TIKA-XXXX)
    • is referenced in the title of the pull request
    • and placed in front of your commit messages surrounded by square brackets ([TIKA-XXXX] Issue or pull request title)
  • commits are squashed into a single one (or few commits for larger changes)
  • Tika is successfully built and unit tests pass by running ./mvnw clean test
  • there should be no conflicts when merging the pull request branch into the recent main branch. If there are conflicts, please try to rebase the pull request branch on top of a freshly pulled main branch
  • if you add new module that downstream users will depend upon add it to relevant group in tika-bom/pom.xml.

We will be able to faster integrate your pull request if these conditions are met. If you have any questions how to fix your problem or about using Tika in general, please sign up for the Tika mailing list. Thanks!

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR (TIKA-4732) refines UNPACK/FRICTIONLESS output semantics so the “original/container” document is represented consistently as embedded id 0 (e.g., unpacked/0.<ext>) and ensures the Frictionless manifest’s name reflects the caller-supplied filename (rather than an internal/spooled name). It also adds/updates tests to lock in the expected Frictionless zip shape and manifest-to-archive consistency.

Changes:

  • Update pipes UNPACK zip/frictionless emission to rely on ParseHandler._preParse adding the container as embedded id 0 (removing separate “original document” handling).
  • Preserve TikaCoreProperties.RESOURCE_NAME_KEY across the “fresh metadata” boundary in PipesWorker so detection/suffix selection and Frictionless name use the logical filename.
  • Add/extend server + integration tests validating Frictionless zip structure, manifest correctness, and includeOriginal/container behavior.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
tika-server/tika-server-standard/src/test/java/org/apache/tika/server/standard/UnpackerResourceTest.java Adds Frictionless-focused server tests for manifest/resources accuracy, container naming, and /unpack vs /unpack/all behavior.
tika-pipes/tika-pipes-integration-tests/src/test/java/org/apache/tika/pipes/core/FrictionlessUnpackTest.java Tightens integration assertions around includeOriginal=true behavior in Frictionless zips (container as unpacked/0.*, manifest paths constrained to unpacked/).
tika-pipes/tika-pipes-core/src/main/java/org/apache/tika/pipes/core/server/PipesWorker.java Removes separate original-document emission and copies RESOURCE_NAME_KEY into the new metadata used during fetch/parse.
tika-pipes/tika-pipes-core/src/main/java/org/apache/tika/pipes/core/extractor/TempFileUnpackHandler.java Removes stored-original-document fields/methods; handler now only tracks embedded entries written during add().
tika-pipes/tika-pipes-core/src/main/java/org/apache/tika/pipes/core/extractor/FrictionlessUnpackHandler.java Removes stored-original-document support and relies on embedded id 0 carrying the container; manifest name now comes from RESOURCE_NAME_KEY in per-file metadata.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated no new comments.

@tballison tballison merged commit a803c16 into main May 18, 2026
6 checks passed
@tballison tballison deleted the TIKA-4732 branch May 18, 2026 15:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants