[refactor](table) Refactor table and file reader#63893
Draft
Gabriel39 wants to merge 7 commits into
Draft
Conversation
Contributor
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
Gabriel39
added a commit
to Gabriel39/incubator-doris
that referenced
this pull request
May 29, 2026
### What problem does this PR solve? Issue Number: close #xxx Related PR: apache#63893 Problem Summary: Add focused BE unit coverage for new table reader and new parquet reader edge cases, including aggregate pushdown over split ranges, Iceberg equality/position deletes, row lineage after delete filtering, Parquet dictionary/statistics pruning, and IOContext release. Also clean up temporary delete predicate expression columns in the new Parquet reader so equality delete predicates with cast children do not alter the returned file block schema. ### Release note None ### Check List (For Author) - Test: Unit Test - Added BE UT cases in table_reader_test and parquet_reader_test. - Ran git diff --check. - Tried ./run-be-ut.sh with focused filters, but local JAVA_HOME points to JDK 11 and JDK_17 is not set; the runner requires JDK 17. - Behavior changed: No - Does this need documentation: No
16 tasks
Gabriel39
added a commit
that referenced
this pull request
May 29, 2026
### What problem does this PR solve? Issue Number: close #xxx Related PR: #63893 Problem Summary: Add focused BE unit coverage for new table reader and new parquet reader edge cases, including aggregate pushdown over split ranges, Iceberg equality/position deletes, row lineage after delete filtering, Parquet dictionary/statistics pruning, and IOContext release. Also clean up temporary delete predicate expression columns in the new Parquet reader so equality delete predicates with cast children do not alter the returned file block schema. ### Release note None ### Check List (For Author) - Test: Unit Test - Added BE UT cases in table_reader_test and parquet_reader_test. - Ran git diff --check. - Tried ./run-be-ut.sh with focused filters, but local JAVA_HOME points to JDK 11 and JDK_17 is not set; the runner requires JDK 17. - Behavior changed: No - Does this need documentation: No ### What problem does this PR solve? Issue Number: close #xxx Related PR: #xxx Problem Summary: ### Release note None ### Check List (For Author) - Test <!-- At least one of them must be included. --> - [ ] Regression test - [ ] Unit Test - [ ] Manual test (add detailed scripts or steps below) - [ ] No need to test or manual test. Explain why: - [ ] This is a refactor/code format and no logic has been changed. - [ ] Previous test can cover this change. - [ ] No code files have been changed. - [ ] Other reason <!-- Add your reason? --> - Behavior changed: - [ ] No. - [ ] Yes. <!-- Explain the behavior change --> - Does this need documentation? - [ ] No. - [ ] Yes. <!-- Add document PR link here. eg: apache/doris-website#1214 --> ### Check List (For Reviewer who merge this PR) - [ ] Confirm the release note - [ ] Confirm test cases - [ ] Confirm document - [ ] Add branch pick label <!-- Add branch pick label that this PR should merge into -->
18b74d2 to
837cc56
Compare
### What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary: Squash the refactored reader branch into one commit on top of master. The change adds the refactored TableReader/FileReader stack, the new parquet reader path, table-format readers, nested projection/filter support, aggregate pushdown support, FileScannerV2, and related BE tests and design docs.
### Release note
None
### Check List (For Author)
- Test: Manual test
- Ran git diff --cached --check before committing.
- Behavior changed: Yes
- Does this need documentation: No
837cc56 to
475e48a
Compare
Contributor
Author
|
run buildall |
Contributor
TPC-H: Total hot run time: 29107 ms |
Contributor
FE Regression Coverage ReportIncrement line coverage |
Contributor
TPC-DS: Total hot run time: 168765 ms |
### What changed - Simplified the file reader schema layout and documented the intent. - Removed the parquet shape-only reader wrapper and let unprojected nested children advance through their original reader skip path. - Refactored new parquet MAP/LIST nested assembly toward local reader-owned Dremel traversal. - Localized MAP-only repeated assembly helpers in MapColumnReader. - Simplified nested scalar batch state by removing values_written and omitting value_indices for dense nested leaf batches. - Updated complex column refactor documentation with the current Phase 3/4 status. ### Why This keeps Doris new parquet complex column handling closer to the intended reader layering: LIST owns ColumnArray assembly, MAP owns ColumnMap assembly, and shared nested helpers only keep the state that multiple readers actually need. ### Validation - Local git diff --check. - Fedora /home/socrates/code/doris: BUILD_TYPE=DEBUG ./build.sh --be passed after each code step. - UT not run in this round. ### Notes - PR target: apache/doris refact_reader_branch. - Head branch: suxiaogang223:codex/simplify-file-reader-schema. --------- Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
…ts filtering (#64098) ## Summary Implements complex type predicate filtering and statistics-based file-layer pruning for nested Parquet STRUCT columns, aligning with DuckDB's nested filter semantics while respecting Doris' new parquet reader architecture. ## Changes ### Row-level Expr Localization - `struct_element(VSlotRef(parent), literal child)` chains are recognized as nested paths - Parent slot is rewritten to file-local top-level block slot while preserving `struct_element` form - Struct children are NOT registered as independent block slots ### Filter-only Nested Projection - Filter-referenced struct children are merged into the same top-level complex column's `FieldProjection.children` - Output children maintain priority order; filter-only children are appended to read projection - Filter-only children are excluded from `ColumnMapping.child_mappings` to avoid affecting table output materialization ### Nested File-layer Pruning Target - `FileColumnPredicateFilter` adds `file_child_id_path` for file-local child field-id paths - AND-semantics `struct_element(...) op literal` / `IN (...)` construct pruning hints - OR/NOT/arbitrary function subtrees are NOT extracted for pruning (safety) - Supports renamed nested children via table-to-file field-id mapping ### Parquet Leaf Resolution & Pruning - `ResolvePredicateLeafSchema()` resolves top-level or nested targets to primitive leaf schema - Row group min/max statistics pruning for nested struct primitives - Dictionary pruning for nested struct string-like columns - Bloom filter pruning via Arrow adapter for supported primitive types - Page index row range pruning for non-repeated primitive leaves only ### Test Coverage - Mapper unit tests: nested predicate filters (GT, IN_LIST, reverse comparison, deep path) - Renamed child projection via field-id mapping - Missing child and OR subtree safety (no false pruning hints) - Real Parquet fixture tests for statistics, dictionary, and page index pruning - Bloom filter unit tests via Arrow adapter ### Out of Scope (intentionally) - LIST/MAP/repeated leaf pruning - Dynamic field names or non-deterministic expressions - Real Parquet bloom filter fixture (Arrow writer lacks stable bloom metadata API) - Full complex child schema change (requires FE/table reader support) ## Related 🤖 Generated with [Claude Code](https://claude.com/claude-code)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary:
Release note
None
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)