feat(storage): Add compressed archive support by akashverma0786 · Pull Request #28146 · open-metadata/OpenMetadata

akashverma0786 · 2026-05-15T12:06:57Z

Describe your changes:

Fixes #

I worked on ... because ...

Type of change:

High-level design:

N/A — small change.

Tests:

Use cases covered

Unit tests

Backend integration tests

Ingestion integration tests

Playwright (UI) tests

Manual testing performed

UI screen recording / screenshots:

Not applicable.

Checklist:

I have read the CONTRIBUTING document.
My PR title is Fixes <issue-number>: <short explanation>
My PR is linked to a GitHub issue via Fixes #<issue-number> above.
I have commented on my code, particularly in hard-to-understand areas.
For JSON Schema changes: I updated the migration scripts or explained why it is not needed.
For UI changes: I attached a screen recording and/or screenshots above.
I have added tests (unit / integration / Playwright as applicable) and listed them above.

Summary by Gitar

Dependency updates:
- Upgraded sqlalchemy-pytds from ~=0.3 to ~=1.0 in ingestion/setup.py to fix AttributeError during server-side cursor fetches.
Storage ingestion features:
- Added support for processing compressed archives (zip, tar, tar.gz, tgz, 7z, rar) in S3 and GCS connectors.
- Implemented open_archive_reader and ArchiveReader infrastructure to handle nested file ingestion with schema inference.
Testing and validation:
- Added comprehensive unit tests in test_archive.py covering archive readers, security checks, and column inference.
- Added GCS and S3 integration tests to verify container generation logic and archive processing flows.
Robustness and safety:
- Added path validation via _is_safe_archive_path to prevent directory traversal vulnerabilities during archive extraction.
- Added try-except error handling in _generate_structured_containers to prevent ingestion failures when processing individual corrupted archives.

_{This will update automatically on new commits.}

github-actions · 2026-05-15T12:07:18Z

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

github-actions · 2026-05-15T12:50:34Z

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

github-actions · 2026-05-15T12:51:11Z

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

github-actions · 2026-05-15T13:02:27Z

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

github-actions · 2026-05-15T13:14:36Z

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

github-actions · 2026-05-18T09:05:35Z

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

github-actions · 2026-05-19T15:30:14Z

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

sonarqubecloud · 2026-05-19T16:56:08Z

Quality Gate passed for 'open-metadata-ingestion'

Issues
3 New issues
0 Accepted issues

Measures
0 Security Hotspots
23.2% Coverage on New Code
1.1% Duplication on New Code

See analysis details on SonarQube Cloud

github-actions · 2026-05-19T18:04:33Z

🟡 Playwright Results — all passed (10 flaky)

✅ 4139 passed · ❌ 0 failed · 🟡 10 flaky · ⏭️ 89 skipped

Shard	Passed	Flaky	Skipped
🟡 Shard 1	298	1	4
🟡 Shard 2	777	1	11
🟡 Shard 3	774	4	7
🟡 Shard 4	781	1	12
🟡 Shard 5	771	2	47
🟡 Shard 6	738	1	8

🟡 10 flaky test(s) (passed on retry)

Features/DataAssetRulesEnabled.spec.ts › Verify the SearchIndex Service Entity Action items after rules is Enabled (shard 1, 1 retry)
Features/KnowledgeCenter.spec.ts › Article mentions in description should working for Knowledge Center (shard 2, 1 retry)
Features/RTL.spec.ts › Verify Following widget functionality (shard 3, 1 retry)
Features/Table.spec.ts › Table pagination with sorting should works (shard 3, 1 retry)
Features/UserProfileOnlineStatus.spec.ts › Should show "Active recently" for users active within last hour (shard 3, 1 retry)
Flow/PersonaFlow.spec.ts › Set default persona for team should work properly (shard 3, 1 retry)
Pages/DomainAdvanced.spec.ts › Filter assets by domain from explore page (shard 4, 1 retry)
Pages/Entity.spec.ts › Delete Search Index (shard 5, 1 retry)
Pages/ExplorePageRightPanel_KnowledgeCenter.spec.ts › Should remove user owner for knowledgeCenter (shard 5, 1 retry)
Pages/Lineage/LineageFilters.spec.ts › Verify lineage schema filter selection (shard 6, 1 retry)

📦 Download artifacts

How to debug locally

# Download playwright-test-results-<shard> artifact and unzip
npx playwright show-trace path/to/trace.zip    # view trace

gitar-bot · 2026-05-19T18:39:12Z

Code Review ✅ Approved 5 resolved / 5 findings

Adds compressed archive support for S3 and GCS connectors, including robust path validation and archive readers. All initial findings regarding zip bomb risks, path traversal, and null pointer exceptions have been resolved.

✅ 5 resolved

✅ Edge Case: is_archive_format crashes on None structureFormat

The function is_archive_format at line 527 calls structure_format.lower() after a truthy check via bool(structure_format). Due to short-circuit evaluation this is safe. However, at the call sites in S3/GCS metadata (e.g., metadata_entry.structureFormat), if structureFormat is None, the function works correctly. No issue here — false alarm on closer inspection. Disregard.

✅ Security: Missing path traversal validation in ZIP and RAR readers

📄 ingestion/src/metadata/readers/archive.py:244-258 📄 ingestion/src/metadata/readers/archive.py:319-333 📄 ingestion/src/metadata/readers/archive.py:471-484
The TarArchiveReader and SevenZipArchiveReader validate entry filenames for path traversal attacks (absolute paths starting with / and .. segments), but ZipArchiveReader, ZipRangeReader, and RarArchiveReader lack equivalent checks. While the data is only read into memory (not extracted to disk), the entry names flow into container prefixes/paths downstream (e.g., inner_prefix = f"{archive_prefix}/{entry.name.lstrip(KEY_SEPARATOR)}") and could produce malformed entity paths or confusing metadata.

Add the same validation used in TarArchiveReader to a shared helper and apply it in all readers, or add it centrally in iter_archive_entries_with_schema().

✅ Security: Zip bomb risk: no ratio check on decompressed data

📄 ingestion/src/metadata/readers/archive.py:332-336
The ZipRangeReader.entries() method decompresses data at line 336 with zlib.decompress(compressed, -15) without verifying the decompression ratio. A malicious ZIP could have a tiny compressed payload that expands to hundreds of MB (within the 256MB limit but with extreme ratios like 1000:1). While _MAX_INNER_FILE_BYTES caps individual file sizes, the uncomp_size field in the Central Directory is attacker-controlled and could be spoofed to be small while actual decompression yields much more data.

Consider using incremental decompression with a running size check, or validating the actual decompressed length matches zip_entry.uncomp_size.

✅ Quality: Overly broad except Exception may mask programming errors

📄 ingestion/src/metadata/ingestion/source/storage/gcs/metadata.py:321 📄 ingestion/src/metadata/ingestion/source/storage/gcs/metadata.py:416-426 📄 ingestion/src/metadata/ingestion/source/storage/s3/metadata.py:354 📄 ingestion/src/metadata/ingestion/source/storage/s3/metadata.py:458-468
Changing except ValueError to except Exception (lines 321/354) and adding a second blanket except Exception around the caller (lines 422/464) means any programming error (e.g., TypeError, AttributeError, KeyError) inside archive processing will be silently swallowed with only a warning log. This makes debugging harder during development and could hide regressions.

Consider catching a tuple of expected runtime exceptions (e.g., (ValueError, OSError, IOError, zipfile.BadZipFile, rarfile.Error)) rather than bare Exception. If the intent is truly to never crash, at minimum log at error level for unexpected exception types so they are visible in monitoring.

✅ Bug: get_by_name() called before container is persisted to server

📄 ingestion/src/metadata/ingestion/source/storage/s3/metadata.py:331-339 📄 ingestion/src/metadata/ingestion/source/storage/gcs/metadata.py:296-304
After yielding the parent archive container (S3ContainerDetails/GCSContainerDetails), the code immediately calls self.metadata.get_by_name(entity=Container, fqn=archive_fqn) to look up the just-yielded container. However, yielding a container detail from the generator doesn't persist it — the topology runner processes the yielded value through yield_create_container_requests and then sends it to the OpenMetadata server in a separate pipeline step. This means get_by_name() will return None on first ingestion (the container doesn't exist yet), causing all child containers to be skipped.

This is a fundamental design issue. Consider either:

Using the topology's built-in parent-child relationship mechanism (context-based references)

Splitting archive processing into two passes — first pass creates parents, second pass creates children

Using a deferred/lazy EntityReference that doesn't require the parent to already exist in the server

Options

Display: compact → Showing less information.

Comment with these commands to change:

`Compact`
`gitar display:verbose`

_{Was this helpful? React with 👍 / 👎 | Gitar}

feat(storage): add compressed archive support

7b9ccfe

gitar-bot Bot reviewed May 15, 2026

View reviewed changes

Comment thread ingestion/src/metadata/ingestion/source/storage/s3/metadata.py

gitar-bot Bot reviewed May 15, 2026

View reviewed changes

Comment thread ingestion/src/metadata/readers/archive.py

gitar-bot Bot reviewed May 15, 2026

View reviewed changes

Comment thread ingestion/src/metadata/readers/archive.py

Fix: Try-Except logging added and better error handling

f551576

Fix: Added req. in setup.py

5bd56c4

gitar-bot Bot reviewed May 15, 2026

View reviewed changes

Comment thread ingestion/src/metadata/ingestion/source/storage/gcs/metadata.py Outdated

Fix: Gitar bot comments

bf74c56

Fix: Gitar bot comment

9fc9ecf

Merge branch 'main' into issue-3798

2b77990

Fix: pychekstyle and added tests

91bc49e

akashverma0786 self-assigned this May 19, 2026

akashverma0786 added Ingestion safe to test Add this label to run secure Github workflows on PRs labels May 19, 2026

Merge branch 'main' into issue-3798

96b4a90

akashverma0786 marked this pull request as ready for review May 19, 2026 15:41

akashverma0786 requested a review from a team as a code owner May 19, 2026 15:41

akashverma0786 temporarily deployed to test May 19, 2026 15:52 — with GitHub Actions Inactive

Fix: Unit test failure

4693b98

akashverma0786 requested a deployment to test May 19, 2026 18:48 — with GitHub Actions In progress

akashverma0786 deployed to test May 19, 2026 18:48 — with GitHub Actions Active

akashverma0786 requested a deployment to test May 19, 2026 18:48 — with GitHub Actions In progress

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(storage): Add compressed archive support#28146

feat(storage): Add compressed archive support#28146
akashverma0786 wants to merge 9 commits into
mainfrom
issue-3798

akashverma0786 commented May 15, 2026 •

edited by gitar-bot Bot

Loading

Uh oh!

github-actions Bot commented May 15, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented May 15, 2026

Uh oh!

github-actions Bot commented May 15, 2026

Uh oh!

Uh oh!

github-actions Bot commented May 15, 2026

Uh oh!

github-actions Bot commented May 15, 2026

Uh oh!

github-actions Bot commented May 18, 2026

Uh oh!

github-actions Bot commented May 19, 2026

Uh oh!

sonarqubecloud Bot commented May 19, 2026

Uh oh!

github-actions Bot commented May 19, 2026

Uh oh!

gitar-bot Bot commented May 19, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

akashverma0786 commented May 15, 2026 • edited by gitar-bot Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Describe your changes:

Type of change:

High-level design:

Tests:

Use cases covered

Unit tests

Backend integration tests

Ingestion integration tests

Playwright (UI) tests

Manual testing performed

UI screen recording / screenshots:

Checklist:

Summary by Gitar

Uh oh!

github-actions Bot commented May 15, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented May 15, 2026

Uh oh!

github-actions Bot commented May 15, 2026

Uh oh!

Uh oh!

github-actions Bot commented May 15, 2026

Uh oh!

github-actions Bot commented May 15, 2026

Uh oh!

github-actions Bot commented May 18, 2026

Uh oh!

github-actions Bot commented May 19, 2026

Uh oh!

sonarqubecloud Bot commented May 19, 2026

Quality Gate passed for 'open-metadata-ingestion'

Uh oh!

github-actions Bot commented May 19, 2026

🟡 Playwright Results — all passed (10 flaky)

Uh oh!

gitar-bot Bot commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

akashverma0786 commented May 15, 2026 •

edited by gitar-bot Bot

Loading

gitar-bot Bot commented May 19, 2026 •

edited

Loading