Skip to content

feat(file-cdk): Support multi-sheet Excel parsing#1031

Draft
Ryan Waskewich (rwask) wants to merge 3 commits into
mainfrom
devin/1779376325-excel-multisheet
Draft

feat(file-cdk): Support multi-sheet Excel parsing#1031
Ryan Waskewich (rwask) wants to merge 3 commits into
mainfrom
devin/1779376325-excel-multisheet

Conversation

@rwask
Copy link
Copy Markdown
Contributor

@rwask Ryan Waskewich (rwask) commented May 21, 2026

Summary

Adds opt-in multi-sheet Excel parsing support to the file-based CDK.

  • Adds sheets_to_sync and sheet_names options to ExcelFormat, defaulting to first-sheet-only for backward compatibility.
  • Parses all selected sheets into the existing stream when enabled and adds _ab_sheet_name to each emitted record.
  • Infers a superset schema across selected sheets and keeps first-sheet-only behavior unchanged by default.
  • Fails fast when selected sheets are missing or a source column collides with reserved _ab_sheet_name metadata.
  • Keeps calamine-to-openpyxl fallback atomic so partially parsed calamine sheets are not emitted before fallback records.
  • Adds parser coverage for default behavior, all sheets, explicit sheets, schema merging, missing sheets, metadata collisions, and partial calamine fallback.

Review & Testing Checklist for Human

  • Confirm _ab_sheet_name is the right reserved metadata field name and that failing on source-column collision is the desired behavior.
  • Confirm the ExcelFormat option names and UI descriptions are acceptable for file-based connectors.
  • Run an end-to-end file-based source sync against a multi-sheet workbook after a CDK release is consumed by a connector such as SharePoint Enterprise.

Notes

Validated locally with:

poetry run pytest unit_tests/sources/file_based/file_types/test_excel_parser.py -q
poetry run ruff check airbyte_cdk/sources/file_based/file_types/excel_parser.py unit_tests/sources/file_based/file_types/test_excel_parser.py
poetry run ruff format --check airbyte_cdk/sources/file_based/file_types/excel_parser.py unit_tests/sources/file_based/file_types/test_excel_parser.py
poetry run mypy --config-file mypy.ini airbyte_cdk

Earlier validation also included:

poetry run poe lint
poetry run poe type-check
poetry run pytest unit_tests/sources/file_based/file_types/test_excel_parser.py unit_tests/sources/file_based/test_file_based_scenarios.py -q -k 'excel or Excel'

SharePoint Enterprise runtime validation was performed against a controlled two-sheet workbook with this CDK branch pinned locally. check succeeded, discover included both first-sheet and second-sheet fields plus _ab_sheet_name, read emitted exactly 5 records split across People and Orders, and an explicit missing sheet failed loudly instead of silently reading another sheet.

Link to Devin session: https://app.devin.ai/sessions/a75bf41ff80c4846b67a200545d7cd19
Requested by: Ryan Waskewich (@rwask)

Co-Authored-By: Ryan Waskewich <ryan.waskewich@airbyte.io>
@devin-ai-integration
Copy link
Copy Markdown
Contributor

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment and CI monitoring

@github-actions
Copy link
Copy Markdown

👋 Greetings, Airbyte Team Member!

Here are some helpful tips and reminders for your convenience.

💡 Show Tips and Tricks

Testing This CDK Version

You can test this version of the CDK using the following:

# Run the CLI from this branch:
uvx 'git+https://github.com/airbytehq/airbyte-python-cdk.git@devin/1779376325-excel-multisheet#egg=airbyte-python-cdk[dev]' --help

# Update a connector to use the CDK from this branch ref:
cd airbyte-integrations/connectors/source-example
poe use-cdk-branch devin/1779376325-excel-multisheet

PR Slash Commands

Airbyte Maintainers can execute the following slash commands on your PR:

  • /autofix - Fixes most formatting and linting issues
  • /poetry-lock - Updates poetry.lock file
  • /test - Runs connector tests with the updated CDK
  • /prerelease - Triggers a prerelease publish with default arguments
  • /poe build - Regenerate git-committed build artifacts, such as the pydantic models which are generated from the manifest JSON schema in YAML.
  • /poe <command> - Runs any poe command in the CDK environment
📚 Show Repo Guidance

Helpful Resources

📝 Edit this welcome message.

Comment on lines +48 to +51
def validate_sheet_selection(
cls,
values: Dict[str, Any], # noqa: N805 # Pydantic validators use cls, not self
) -> Dict[str, Any]:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a Pydantic v1 @root_validator; cls is the expected first parameter for this validator style. Ruff/MyPy are also clean with the existing # noqa: N805, so I’m treating this as a non-functional false positive rather than changing behavior.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding the session link for traceability:


Devin session

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 21, 2026

PyTest Results (Fast)

4 078 tests  +9   4 068 ✅ +10   7m 56s ⏱️ +12s
    1 suites ±0      10 💤  -  1 
    1 files   ±0       0 ❌ ± 0 

Results for commit 148bf19. ± Comparison against base commit f67a9d9.

♻️ This comment has been updated with latest results.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 21, 2026

PyTest Results (Full)

4 081 tests  +9   4 070 ✅ +10   11m 3s ⏱️ +6s
    1 suites ±0      11 💤  -  1 
    1 files   ±0       0 ❌ ± 0 

Results for commit 148bf19. ± Comparison against base commit f67a9d9.

♻️ This comment has been updated with latest results.

Co-Authored-By: Ryan Waskewich <ryan.waskewich@airbyte.io>
@devin-ai-integration
Copy link
Copy Markdown
Contributor

Runtime validation requested by Ryan Waskewich: SharePoint Enterprise successfully consumed this CDK branch and read a controlled multi-sheet Excel workbook.

SharePoint Enterprise runtime test results
  • Passed: check returned CONNECTION_STATUS with status=SUCCEEDED after omitting unset optional config fields.
  • Passed: discover produced stream devin_excel_multisheet_1779383097 with schema fields id, name, amount, and _ab_sheet_name; amount only exists on the second worksheet, so this distinguishes the new multi-sheet path from legacy first-sheet-only behavior.
  • Passed: read emitted exactly 5 records: 2 with _ab_sheet_name=People and 3 with _ab_sheet_name=Orders.
  • Passed: record values matched the fixture exactly: names alice, bob; amounts 10.5, 20.0, 30.25.
  • Passed: explicit missing sheet config sheet_names=["MissingSheet"] failed instead of silently reading another sheet; output included Sheet names ['MissingSheet'] were not found and the workbook path.
  • Cleanup complete: the temporary workbook was deleted from the SharePoint test site; local generated secret/config/output artifacts were removed.
Commands exercised
poetry run source-sharepoint-enterprise check --config secrets/<generated-config>.json
poetry run source-sharepoint-enterprise discover --config secrets/<generated-config>.json
poetry run source-sharepoint-enterprise read --config secrets/<generated-config>.json --catalog test_artifacts/configured_catalog_multisheet.json
poetry run source-sharepoint-enterprise read --config secrets/<missing-sheet-config>.json --catalog test_artifacts/configured_catalog_multisheet.json

The connector environment was pinned to the local CDK branch during validation; no generated SharePoint config, catalog, output, or secret files were committed.

CI caveat

All required CDK checks are passing. The optional Check: source-google-drive job is failing in that connector's standard-test instantiation path with SourceGoogleDrive.__init__() missing catalog, config, and state; I did not change Google Drive or the standard-test harness in this PR. The optional Check: source-shopify job was still pending when I proceeded with the requested SharePoint runtime validation.

Devin session: https://app.devin.ai/sessions/a75bf41ff80c4846b67a200545d7cd19

Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 potential issue.

View 5 additional findings in Devin Review.

Open in Devin Review

Comment on lines +188 to +192
) -> Iterable[pd.DataFrame]:
try:
yield from self._parse_sheets_with_calamine(fp, logger, file, excel_format)
except ExcelCalamineParsingError:
yield from self._parse_sheets_with_openpyxl(fp, logger, file, excel_format)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Generator-based calamine fallback can emit duplicate records for already-yielded sheets

The _parse_sheets method uses yield from inside a try/except to implement calamine-to-openpyxl fallback. Because _parse_sheets_with_calamine is a generator, DataFrames are yielded lazily one sheet at a time. If calamine successfully parses sheet 1 (yielding its DataFrame to the consumer in parse_records at line 95), but then fails with a PanicException on sheet 2, the ExcelCalamineParsingError is caught and the fallback yield from self._parse_sheets_with_openpyxl(...) re-parses the entire file — including sheet 1. This causes all records from sheet 1 to be emitted twice (once from calamine, once from openpyxl).

The original open_and_parse_file (excel_parser.py:344-363) is atomic (returns a single DataFrame), so the same pattern works correctly there. The fix is to materialize all calamine results before yielding, so the fallback is all-or-nothing.

Suggested change
) -> Iterable[pd.DataFrame]:
try:
yield from self._parse_sheets_with_calamine(fp, logger, file, excel_format)
except ExcelCalamineParsingError:
yield from self._parse_sheets_with_openpyxl(fp, logger, file, excel_format)
) -> Iterable[pd.DataFrame]:
try:
dataframes = list(self._parse_sheets_with_calamine(fp, logger, file, excel_format))
except ExcelCalamineParsingError:
dataframes = list(self._parse_sheets_with_openpyxl(fp, logger, file, excel_format))
yield from dataframes
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirmed this was a real generator/fallback bug. I updated _parse_sheets so calamine results are materialized before yielding; fallback to openpyxl is now all-or-nothing, so partially yielded calamine sheets cannot be duplicated. I also added test_multi_sheet_calamine_fallback_does_not_duplicate_partial_results to cover this case.

Validation run:

poetry run pytest unit_tests/sources/file_based/file_types/test_excel_parser.py -q
poetry run ruff check airbyte_cdk/sources/file_based/file_types/excel_parser.py unit_tests/sources/file_based/file_types/test_excel_parser.py
poetry run ruff format --check airbyte_cdk/sources/file_based/file_types/excel_parser.py unit_tests/sources/file_based/file_types/test_excel_parser.py
poetry run mypy --config-file mypy.ini airbyte_cdk

Devin session

Co-Authored-By: Ryan Waskewich <ryan.waskewich@airbyte.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant