feat(file-cdk): Support multi-sheet Excel parsing#1031
feat(file-cdk): Support multi-sheet Excel parsing#1031Ryan Waskewich (rwask) wants to merge 3 commits into
Conversation
Co-Authored-By: Ryan Waskewich <ryan.waskewich@airbyte.io>
🤖 Devin AI EngineerI'll be helping with this pull request! Here's what you should know: ✅ I will automatically:
Note: I can only respond to comments from users who have write access to this repository. ⚙️ Control Options:
|
👋 Greetings, Airbyte Team Member!Here are some helpful tips and reminders for your convenience. 💡 Show Tips and TricksTesting This CDK VersionYou can test this version of the CDK using the following: # Run the CLI from this branch:
uvx 'git+https://github.com/airbytehq/airbyte-python-cdk.git@devin/1779376325-excel-multisheet#egg=airbyte-python-cdk[dev]' --help
# Update a connector to use the CDK from this branch ref:
cd airbyte-integrations/connectors/source-example
poe use-cdk-branch devin/1779376325-excel-multisheetPR Slash CommandsAirbyte Maintainers can execute the following slash commands on your PR:
|
| def validate_sheet_selection( | ||
| cls, | ||
| values: Dict[str, Any], # noqa: N805 # Pydantic validators use cls, not self | ||
| ) -> Dict[str, Any]: |
There was a problem hiding this comment.
This is a Pydantic v1 @root_validator; cls is the expected first parameter for this validator style. Ruff/MyPy are also clean with the existing # noqa: N805, so I’m treating this as a non-functional false positive rather than changing behavior.
There was a problem hiding this comment.
Co-Authored-By: Ryan Waskewich <ryan.waskewich@airbyte.io>
|
Runtime validation requested by Ryan Waskewich: SharePoint Enterprise successfully consumed this CDK branch and read a controlled multi-sheet Excel workbook. SharePoint Enterprise runtime test results
Commands exercisedpoetry run source-sharepoint-enterprise check --config secrets/<generated-config>.json
poetry run source-sharepoint-enterprise discover --config secrets/<generated-config>.json
poetry run source-sharepoint-enterprise read --config secrets/<generated-config>.json --catalog test_artifacts/configured_catalog_multisheet.json
poetry run source-sharepoint-enterprise read --config secrets/<missing-sheet-config>.json --catalog test_artifacts/configured_catalog_multisheet.jsonThe connector environment was pinned to the local CDK branch during validation; no generated SharePoint config, catalog, output, or secret files were committed. CI caveatAll required CDK checks are passing. The optional Devin session: https://app.devin.ai/sessions/a75bf41ff80c4846b67a200545d7cd19 |
| ) -> Iterable[pd.DataFrame]: | ||
| try: | ||
| yield from self._parse_sheets_with_calamine(fp, logger, file, excel_format) | ||
| except ExcelCalamineParsingError: | ||
| yield from self._parse_sheets_with_openpyxl(fp, logger, file, excel_format) |
There was a problem hiding this comment.
🔴 Generator-based calamine fallback can emit duplicate records for already-yielded sheets
The _parse_sheets method uses yield from inside a try/except to implement calamine-to-openpyxl fallback. Because _parse_sheets_with_calamine is a generator, DataFrames are yielded lazily one sheet at a time. If calamine successfully parses sheet 1 (yielding its DataFrame to the consumer in parse_records at line 95), but then fails with a PanicException on sheet 2, the ExcelCalamineParsingError is caught and the fallback yield from self._parse_sheets_with_openpyxl(...) re-parses the entire file — including sheet 1. This causes all records from sheet 1 to be emitted twice (once from calamine, once from openpyxl).
The original open_and_parse_file (excel_parser.py:344-363) is atomic (returns a single DataFrame), so the same pattern works correctly there. The fix is to materialize all calamine results before yielding, so the fallback is all-or-nothing.
| ) -> Iterable[pd.DataFrame]: | |
| try: | |
| yield from self._parse_sheets_with_calamine(fp, logger, file, excel_format) | |
| except ExcelCalamineParsingError: | |
| yield from self._parse_sheets_with_openpyxl(fp, logger, file, excel_format) | |
| ) -> Iterable[pd.DataFrame]: | |
| try: | |
| dataframes = list(self._parse_sheets_with_calamine(fp, logger, file, excel_format)) | |
| except ExcelCalamineParsingError: | |
| dataframes = list(self._parse_sheets_with_openpyxl(fp, logger, file, excel_format)) | |
| yield from dataframes |
Was this helpful? React with 👍 or 👎 to provide feedback.
There was a problem hiding this comment.
Confirmed this was a real generator/fallback bug. I updated _parse_sheets so calamine results are materialized before yielding; fallback to openpyxl is now all-or-nothing, so partially yielded calamine sheets cannot be duplicated. I also added test_multi_sheet_calamine_fallback_does_not_duplicate_partial_results to cover this case.
Validation run:
poetry run pytest unit_tests/sources/file_based/file_types/test_excel_parser.py -q
poetry run ruff check airbyte_cdk/sources/file_based/file_types/excel_parser.py unit_tests/sources/file_based/file_types/test_excel_parser.py
poetry run ruff format --check airbyte_cdk/sources/file_based/file_types/excel_parser.py unit_tests/sources/file_based/file_types/test_excel_parser.py
poetry run mypy --config-file mypy.ini airbyte_cdkCo-Authored-By: Ryan Waskewich <ryan.waskewich@airbyte.io>
Summary
Adds opt-in multi-sheet Excel parsing support to the file-based CDK.
sheets_to_syncandsheet_namesoptions toExcelFormat, defaulting to first-sheet-only for backward compatibility._ab_sheet_nameto each emitted record._ab_sheet_namemetadata.Review & Testing Checklist for Human
_ab_sheet_nameis the right reserved metadata field name and that failing on source-column collision is the desired behavior.ExcelFormatoption names and UI descriptions are acceptable for file-based connectors.Notes
Validated locally with:
Earlier validation also included:
poetry run poe lint poetry run poe type-check poetry run pytest unit_tests/sources/file_based/file_types/test_excel_parser.py unit_tests/sources/file_based/test_file_based_scenarios.py -q -k 'excel or Excel'SharePoint Enterprise runtime validation was performed against a controlled two-sheet workbook with this CDK branch pinned locally.
checksucceeded,discoverincluded both first-sheet and second-sheet fields plus_ab_sheet_name,reademitted exactly 5 records split acrossPeopleandOrders, and an explicit missing sheet failed loudly instead of silently reading another sheet.Link to Devin session: https://app.devin.ai/sessions/a75bf41ff80c4846b67a200545d7cd19
Requested by: Ryan Waskewich (@rwask)