Skip to content

feat(storage): Add compressed archive support#28146

Open
akashverma0786 wants to merge 9 commits into
mainfrom
issue-3798
Open

feat(storage): Add compressed archive support#28146
akashverma0786 wants to merge 9 commits into
mainfrom
issue-3798

Conversation

@akashverma0786
Copy link
Copy Markdown
Collaborator

@akashverma0786 akashverma0786 commented May 15, 2026

Describe your changes:

Fixes #

I worked on ... because ...

Type of change:

  • Bug fix
  • Improvement
  • New feature
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation

High-level design:

N/A — small change.

Tests:

Use cases covered

Unit tests

Backend integration tests

Ingestion integration tests

Playwright (UI) tests

Manual testing performed

UI screen recording / screenshots:

Not applicable.

Checklist:

  • I have read the CONTRIBUTING document.
  • My PR title is Fixes <issue-number>: <short explanation>
  • My PR is linked to a GitHub issue via Fixes #<issue-number> above.
  • I have commented on my code, particularly in hard-to-understand areas.
  • For JSON Schema changes: I updated the migration scripts or explained why it is not needed.
  • For UI changes: I attached a screen recording and/or screenshots above.
  • I have added tests (unit / integration / Playwright as applicable) and listed them above.

Summary by Gitar

  • Dependency updates:
    • Upgraded sqlalchemy-pytds from ~=0.3 to ~=1.0 in ingestion/setup.py to fix AttributeError during server-side cursor fetches.
  • Storage ingestion features:
    • Added support for processing compressed archives (zip, tar, tar.gz, tgz, 7z, rar) in S3 and GCS connectors.
    • Implemented open_archive_reader and ArchiveReader infrastructure to handle nested file ingestion with schema inference.
  • Testing and validation:
    • Added comprehensive unit tests in test_archive.py covering archive readers, security checks, and column inference.
    • Added GCS and S3 integration tests to verify container generation logic and archive processing flows.
  • Robustness and safety:
    • Added path validation via _is_safe_archive_path to prevent directory traversal vulnerabilities during archive extraction.
    • Added try-except error handling in _generate_structured_containers to prevent ingestion failures when processing individual corrupted archives.

This will update automatically on new commits.

@github-actions
Copy link
Copy Markdown
Contributor

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

Comment thread ingestion/src/metadata/ingestion/source/storage/s3/metadata.py
Comment thread ingestion/src/metadata/readers/archive.py
Comment thread ingestion/src/metadata/readers/archive.py
@github-actions
Copy link
Copy Markdown
Contributor

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

@github-actions
Copy link
Copy Markdown
Contributor

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

Comment thread ingestion/src/metadata/ingestion/source/storage/gcs/metadata.py Outdated
@github-actions
Copy link
Copy Markdown
Contributor

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

@github-actions
Copy link
Copy Markdown
Contributor

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

@github-actions
Copy link
Copy Markdown
Contributor

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

@github-actions
Copy link
Copy Markdown
Contributor

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

@akashverma0786 akashverma0786 self-assigned this May 19, 2026
@akashverma0786 akashverma0786 added Ingestion safe to test Add this label to run secure Github workflows on PRs labels May 19, 2026
@sonarqubecloud
Copy link
Copy Markdown

@github-actions
Copy link
Copy Markdown
Contributor

🟡 Playwright Results — all passed (10 flaky)

✅ 4139 passed · ❌ 0 failed · 🟡 10 flaky · ⏭️ 89 skipped

Shard Passed Failed Flaky Skipped
🟡 Shard 1 298 0 1 4
🟡 Shard 2 777 0 1 11
🟡 Shard 3 774 0 4 7
🟡 Shard 4 781 0 1 12
🟡 Shard 5 771 0 2 47
🟡 Shard 6 738 0 1 8
🟡 10 flaky test(s) (passed on retry)
  • Features/DataAssetRulesEnabled.spec.ts › Verify the SearchIndex Service Entity Action items after rules is Enabled (shard 1, 1 retry)
  • Features/KnowledgeCenter.spec.ts › Article mentions in description should working for Knowledge Center (shard 2, 1 retry)
  • Features/RTL.spec.ts › Verify Following widget functionality (shard 3, 1 retry)
  • Features/Table.spec.ts › Table pagination with sorting should works (shard 3, 1 retry)
  • Features/UserProfileOnlineStatus.spec.ts › Should show "Active recently" for users active within last hour (shard 3, 1 retry)
  • Flow/PersonaFlow.spec.ts › Set default persona for team should work properly (shard 3, 1 retry)
  • Pages/DomainAdvanced.spec.ts › Filter assets by domain from explore page (shard 4, 1 retry)
  • Pages/Entity.spec.ts › Delete Search Index (shard 5, 1 retry)
  • Pages/ExplorePageRightPanel_KnowledgeCenter.spec.ts › Should remove user owner for knowledgeCenter (shard 5, 1 retry)
  • Pages/Lineage/LineageFilters.spec.ts › Verify lineage schema filter selection (shard 6, 1 retry)

📦 Download artifacts

How to debug locally
# Download playwright-test-results-<shard> artifact and unzip
npx playwright show-trace path/to/trace.zip    # view trace

@gitar-bot
Copy link
Copy Markdown

gitar-bot Bot commented May 19, 2026

Code Review ✅ Approved 5 resolved / 5 findings

Adds compressed archive support for S3 and GCS connectors, including robust path validation and archive readers. All initial findings regarding zip bomb risks, path traversal, and null pointer exceptions have been resolved.

✅ 5 resolved
Edge Case: is_archive_format crashes on None structureFormat

The function is_archive_format at line 527 calls structure_format.lower() after a truthy check via bool(structure_format). Due to short-circuit evaluation this is safe. However, at the call sites in S3/GCS metadata (e.g., metadata_entry.structureFormat), if structureFormat is None, the function works correctly. No issue here — false alarm on closer inspection. Disregard.

Security: Missing path traversal validation in ZIP and RAR readers

📄 ingestion/src/metadata/readers/archive.py:244-258 📄 ingestion/src/metadata/readers/archive.py:319-333 📄 ingestion/src/metadata/readers/archive.py:471-484
The TarArchiveReader and SevenZipArchiveReader validate entry filenames for path traversal attacks (absolute paths starting with / and .. segments), but ZipArchiveReader, ZipRangeReader, and RarArchiveReader lack equivalent checks. While the data is only read into memory (not extracted to disk), the entry names flow into container prefixes/paths downstream (e.g., inner_prefix = f"{archive_prefix}/{entry.name.lstrip(KEY_SEPARATOR)}") and could produce malformed entity paths or confusing metadata.

Add the same validation used in TarArchiveReader to a shared helper and apply it in all readers, or add it centrally in iter_archive_entries_with_schema().

Security: Zip bomb risk: no ratio check on decompressed data

📄 ingestion/src/metadata/readers/archive.py:332-336
The ZipRangeReader.entries() method decompresses data at line 336 with zlib.decompress(compressed, -15) without verifying the decompression ratio. A malicious ZIP could have a tiny compressed payload that expands to hundreds of MB (within the 256MB limit but with extreme ratios like 1000:1). While _MAX_INNER_FILE_BYTES caps individual file sizes, the uncomp_size field in the Central Directory is attacker-controlled and could be spoofed to be small while actual decompression yields much more data.

Consider using incremental decompression with a running size check, or validating the actual decompressed length matches zip_entry.uncomp_size.

Quality: Overly broad except Exception may mask programming errors

📄 ingestion/src/metadata/ingestion/source/storage/gcs/metadata.py:321 📄 ingestion/src/metadata/ingestion/source/storage/gcs/metadata.py:416-426 📄 ingestion/src/metadata/ingestion/source/storage/s3/metadata.py:354 📄 ingestion/src/metadata/ingestion/source/storage/s3/metadata.py:458-468
Changing except ValueError to except Exception (lines 321/354) and adding a second blanket except Exception around the caller (lines 422/464) means any programming error (e.g., TypeError, AttributeError, KeyError) inside archive processing will be silently swallowed with only a warning log. This makes debugging harder during development and could hide regressions.

Consider catching a tuple of expected runtime exceptions (e.g., (ValueError, OSError, IOError, zipfile.BadZipFile, rarfile.Error)) rather than bare Exception. If the intent is truly to never crash, at minimum log at error level for unexpected exception types so they are visible in monitoring.

Bug: get_by_name() called before container is persisted to server

📄 ingestion/src/metadata/ingestion/source/storage/s3/metadata.py:331-339 📄 ingestion/src/metadata/ingestion/source/storage/gcs/metadata.py:296-304
After yielding the parent archive container (S3ContainerDetails/GCSContainerDetails), the code immediately calls self.metadata.get_by_name(entity=Container, fqn=archive_fqn) to look up the just-yielded container. However, yielding a container detail from the generator doesn't persist it — the topology runner processes the yielded value through yield_create_container_requests and then sends it to the OpenMetadata server in a separate pipeline step. This means get_by_name() will return None on first ingestion (the container doesn't exist yet), causing all child containers to be skipped.

This is a fundamental design issue. Consider either:

  1. Using the topology's built-in parent-child relationship mechanism (context-based references)
  2. Splitting archive processing into two passes — first pass creates parents, second pass creates children
  3. Using a deferred/lazy EntityReference that doesn't require the parent to already exist in the server
Options

Display: compact → Showing less information.

Comment with these commands to change:

Compact
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Ingestion safe to test Add this label to run secure Github workflows on PRs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant