Skip to content

[refactor](table) Refactor table and file reader#63893

Draft
Gabriel39 wants to merge 7 commits into
masterfrom
refact_reader_branch
Draft

[refactor](table) Refactor table and file reader#63893
Gabriel39 wants to merge 7 commits into
masterfrom
refact_reader_branch

Conversation

@Gabriel39
Copy link
Copy Markdown
Contributor

What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@Gabriel39 Gabriel39 marked this pull request as draft May 29, 2026 06:39
Gabriel39 added a commit to Gabriel39/incubator-doris that referenced this pull request May 29, 2026
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: apache#63893

Problem Summary: Add focused BE unit coverage for new table reader and new parquet reader edge cases, including aggregate pushdown over split ranges, Iceberg equality/position deletes, row lineage after delete filtering, Parquet dictionary/statistics pruning, and IOContext release. Also clean up temporary delete predicate expression columns in the new Parquet reader so equality delete predicates with cast children do not alter the returned file block schema.

### Release note

None

### Check List (For Author)

- Test: Unit Test
    - Added BE UT cases in table_reader_test and parquet_reader_test.
    - Ran git diff --check.
    - Tried ./run-be-ut.sh with focused filters, but local JAVA_HOME points to JDK 11 and JDK_17 is not set; the runner requires JDK 17.
- Behavior changed: No
- Does this need documentation: No
Gabriel39 added a commit that referenced this pull request May 29, 2026
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #63893

Problem Summary: Add focused BE unit coverage for new table reader and
new parquet reader edge cases, including aggregate pushdown over split
ranges, Iceberg equality/position deletes, row lineage after delete
filtering, Parquet dictionary/statistics pruning, and IOContext release.
Also clean up temporary delete predicate expression columns in the new
Parquet reader so equality delete predicates with cast children do not
alter the returned file block schema.

### Release note

None

### Check List (For Author)

- Test: Unit Test
    - Added BE UT cases in table_reader_test and parquet_reader_test.
    - Ran git diff --check.
- Tried ./run-be-ut.sh with focused filters, but local JAVA_HOME points
to JDK 11 and JDK_17 is not set; the runner requires JDK 17.
- Behavior changed: No
- Does this need documentation: No

### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:

### Release note

None

### Check List (For Author)

- Test <!-- At least one of them must be included. -->
    - [ ] Regression test
    - [ ] Unit Test
    - [ ] Manual test (add detailed scripts or steps below)
    - [ ] No need to test or manual test. Explain why:
- [ ] This is a refactor/code format and no logic has been changed.
        - [ ] Previous test can cover this change.
        - [ ] No code files have been changed.
        - [ ] Other reason <!-- Add your reason?  -->

- Behavior changed:
    - [ ] No.
    - [ ] Yes. <!-- Explain the behavior change -->

- Does this need documentation?
    - [ ] No.
- [ ] Yes. <!-- Add document PR link here. eg:
apache/doris-website#1214 -->

### Check List (For Reviewer who merge this PR)

- [ ] Confirm the release note
- [ ] Confirm test cases
- [ ] Confirm document
- [ ] Add branch pick label <!-- Add branch pick label that this PR
should merge into -->
@Gabriel39 Gabriel39 force-pushed the refact_reader_branch branch 3 times, most recently from 18b74d2 to 837cc56 Compare June 3, 2026 05:02
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: Squash the refactored reader branch into one commit on top of master. The change adds the refactored TableReader/FileReader stack, the new parquet reader path, table-format readers, nested projection/filter support, aggregate pushdown support, FileScannerV2, and related BE tests and design docs.

### Release note

None

### Check List (For Author)

- Test: Manual test
    - Ran git diff --cached --check before committing.
- Behavior changed: Yes
- Does this need documentation: No
@Gabriel39 Gabriel39 force-pushed the refact_reader_branch branch from 837cc56 to 475e48a Compare June 3, 2026 05:14
@Gabriel39
Copy link
Copy Markdown
Contributor Author

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-H: Total hot run time: 29107 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit c7e07bf0f4367f7634e524741d26894f84d16410, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17461	4051	4019	4019
q2	q3	10688	1355	807	807
q4	4687	472	342	342
q5	7582	883	589	589
q6	187	173	139	139
q7	796	840	650	650
q8	9383	1609	1560	1560
q9	5881	4483	4507	4483
q10	6786	1831	1557	1557
q11	421	269	252	252
q12	636	422	288	288
q13	18186	3500	2734	2734
q14	264	264	253	253
q15	q16	812	779	704	704
q17	1018	1005	897	897
q18	6979	5935	5593	5593
q19	2044	1276	992	992
q20	505	386	268	268
q21	6381	2852	2672	2672
q22	475	385	308	308
Total cold run time: 101172 ms
Total hot run time: 29107 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	5074	4871	4694	4694
q2	q3	4929	5301	4650	4650
q4	2133	2173	1402	1402
q5	4796	4986	4655	4655
q6	228	181	128	128
q7	1915	1752	1598	1598
q8	2429	2131	2084	2084
q9	7840	7666	7441	7441
q10	4695	4665	4246	4246
q11	530	379	352	352
q12	730	734	521	521
q13	3029	3327	2799	2799
q14	287	281	254	254
q15	q16	687	692	603	603
q17	1280	1252	1252	1252
q18	7398	6975	6771	6771
q19	1109	1094	1118	1094
q20	2214	2220	1955	1955
q21	5243	4552	4491	4491
q22	513	471	436	436
Total cold run time: 57059 ms
Total hot run time: 51426 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 66.67% (2/3) 🎉
Increment coverage report
Complete coverage report

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-DS: Total hot run time: 168765 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit c7e07bf0f4367f7634e524741d26894f84d16410, data reload: false

query5	4336	620	484	484
query6	459	210	175	175
query7	4842	554	313	313
query8	375	211	208	208
query9	8801	4019	4022	4019
query10	434	321	274	274
query11	5941	2346	2210	2210
query12	166	104	101	101
query13	1287	678	408	408
query14	6372	5363	5054	5054
query14_1	4390	4379	4354	4354
query15	203	195	175	175
query16	1032	453	433	433
query17	1120	709	583	583
query18	2464	465	344	344
query19	202	180	136	136
query20	109	109	109	109
query21	211	135	116	116
query22	13661	13491	13333	13333
query23	17282	16449	16129	16129
query23_1	16201	16314	16343	16314
query24	7482	1728	1346	1346
query24_1	1335	1319	1306	1306
query25	589	467	411	411
query26	1359	317	178	178
query27	2601	555	319	319
query28	4489	2082	2027	2027
query29	1112	628	491	491
query30	312	235	197	197
query31	1113	1107	963	963
query32	106	62	62	62
query33	535	339	275	275
query34	1180	1194	651	651
query35	756	803	714	714
query36	1416	1408	1236	1236
query37	160	114	127	114
query38	3209	3114	3049	3049
query39	916	908	898	898
query39_1	884	884	870	870
query40	224	125	101	101
query41	64	62	61	61
query42	94	97	93	93
query43	315	321	280	280
query44	
query45	201	189	180	180
query46	1108	1190	719	719
query47	2370	2376	2229	2229
query48	401	421	303	303
query49	622	469	354	354
query50	982	356	259	259
query51	4282	4268	4232	4232
query52	87	87	81	81
query53	246	269	191	191
query54	270	227	204	204
query55	78	75	70	70
query56	242	224	215	215
query57	1459	1415	1332	1332
query58	252	216	199	199
query59	1591	1676	1425	1425
query60	285	246	234	234
query61	165	167	160	160
query62	692	642	576	576
query63	232	178	183	178
query64	2521	772	614	614
query65	
query66	1769	474	339	339
query67	29729	29711	28959	28959
query68	
query69	422	301	263	263
query70	944	988	988	988
query71	300	212	214	212
query72	2912	2878	2593	2593
query73	837	775	461	461
query74	5130	4933	4801	4801
query75	2695	2550	2225	2225
query76	2283	1160	784	784
query77	353	385	292	292
query78	12324	12452	11858	11858
query79	1456	996	795	795
query80	586	480	396	396
query81	461	287	244	244
query82	577	158	128	128
query83	353	270	261	261
query84	257	140	108	108
query85	868	537	442	442
query86	361	308	292	292
query87	3346	3309	3141	3141
query88	3658	2762	2763	2762
query89	419	374	323	323
query90	1969	182	185	182
query91	178	164	134	134
query92	65	64	60	60
query93	1555	1506	856	856
query94	567	370	309	309
query95	674	475	351	351
query96	1038	751	340	340
query97	2698	2679	2598	2598
query98	229	205	208	205
query99	1171	1180	1023	1023
Total cold run time: 250908 ms
Total hot run time: 168765 ms

suxiaogang223 and others added 3 commits June 4, 2026 10:16
### What changed

- Simplified the file reader schema layout and documented the intent.
- Removed the parquet shape-only reader wrapper and let unprojected
nested children advance through their original reader skip path.
- Refactored new parquet MAP/LIST nested assembly toward local
reader-owned Dremel traversal.
- Localized MAP-only repeated assembly helpers in MapColumnReader.
- Simplified nested scalar batch state by removing values_written and
omitting value_indices for dense nested leaf batches.
- Updated complex column refactor documentation with the current Phase
3/4 status.

### Why

This keeps Doris new parquet complex column handling closer to the
intended reader layering: LIST owns ColumnArray assembly, MAP owns
ColumnMap assembly, and shared nested helpers only keep the state that
multiple readers actually need.

### Validation

- Local git diff --check.
- Fedora /home/socrates/code/doris: BUILD_TYPE=DEBUG ./build.sh --be
passed after each code step.
- UT not run in this round.

### Notes

- PR target: apache/doris refact_reader_branch.
- Head branch: suxiaogang223:codex/simplify-file-reader-schema.

---------

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
…ts filtering (#64098)

## Summary

Implements complex type predicate filtering and statistics-based
file-layer pruning for nested Parquet STRUCT columns, aligning with
DuckDB's nested filter semantics while respecting Doris' new parquet
reader architecture.

## Changes

### Row-level Expr Localization
- `struct_element(VSlotRef(parent), literal child)` chains are
recognized as nested paths
- Parent slot is rewritten to file-local top-level block slot while
preserving `struct_element` form
- Struct children are NOT registered as independent block slots

### Filter-only Nested Projection
- Filter-referenced struct children are merged into the same top-level
complex column's `FieldProjection.children`
- Output children maintain priority order; filter-only children are
appended to read projection
- Filter-only children are excluded from `ColumnMapping.child_mappings`
to avoid affecting table output materialization

### Nested File-layer Pruning Target
- `FileColumnPredicateFilter` adds `file_child_id_path` for file-local
child field-id paths
- AND-semantics `struct_element(...) op literal` / `IN (...)` construct
pruning hints
- OR/NOT/arbitrary function subtrees are NOT extracted for pruning
(safety)
- Supports renamed nested children via table-to-file field-id mapping

### Parquet Leaf Resolution & Pruning
- `ResolvePredicateLeafSchema()` resolves top-level or nested targets to
primitive leaf schema
- Row group min/max statistics pruning for nested struct primitives
- Dictionary pruning for nested struct string-like columns
- Bloom filter pruning via Arrow adapter for supported primitive types
- Page index row range pruning for non-repeated primitive leaves only

### Test Coverage
- Mapper unit tests: nested predicate filters (GT, IN_LIST, reverse
comparison, deep path)
- Renamed child projection via field-id mapping
- Missing child and OR subtree safety (no false pruning hints)
- Real Parquet fixture tests for statistics, dictionary, and page index
pruning
- Bloom filter unit tests via Arrow adapter

### Out of Scope (intentionally)
- LIST/MAP/repeated leaf pruning
- Dynamic field names or non-deterministic expressions
- Real Parquet bloom filter fixture (Arrow writer lacks stable bloom
metadata API)
- Full complex child schema change (requires FE/table reader support)

## Related

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants