Skip to content

[Fix](pyudf) Convert nested map value correctly#63907

Merged
zclllyybb merged 3 commits into
apache:masterfrom
linrrzqqq:pyudf-nested-map
Jun 2, 2026
Merged

[Fix](pyudf) Convert nested map value correctly#63907
zclllyybb merged 3 commits into
apache:masterfrom
linrrzqqq:pyudf-nested-map

Conversation

@linrrzqqq
Copy link
Copy Markdown
Collaborator

Problem Summary:

Fix Python UDF nested complex type conversion when MAP appears inside ARRAY, STRUCT, or vectorized inputs.

Previously, Python UDF argument conversion mostly relied on PyArrow's default conversions(Scalar.as_py(), Array.to_pylist(), Array.to_pandas()). Those APIs convert a top-level Arrow MAP into Python-friendly values in some paths, but nested MAP values are exposed as list-of-tuples. For example, ARRAY<MAP<STRING, INT>> could arrive in Python as [[('a', 1)]] instead of [{'a': 1}]. This made user UDF code see nested maps as list instead of dict.

This PR introduces a recursive Arrow-value conversion helper and applies it consistently across Python UDF argument conversion paths. The helper manually reconstructs Python values according to the Arrow type:

  • MAP -> dict
  • LIST / LARGE_LIST -> list
  • STRUCT -> dict

before

CREATE FUNCTION py_deep_nested_debug(ARRAY<MAP<STRING, ARRAY<INT>>> )
RETURNS STRING
PROPERTIES (
    "type" = "PYTHON_UDF",
    "symbol" = "evaluate",
    "runtime_version" = "3.12.11",
    "always_nullable" = "true"
)
AS $$
def evaluate(arr):
    if arr is None:
        return 'None'
    return 'outer_type={}, outer_repr={}'.format(type(arr).__name__, repr(arr))
$$;

SELECT py_deep_nested_debug([{'a': [1, 2], 'b': [3]}, {'c': [4, 5, 6]}]);
+-------------------------------------------------------------------------------+
| py_deep_nested_debug([{'a': [1, 2], 'b': [3]}, {'c': [4, 5, 6]}])             |
+-------------------------------------------------------------------------------+
| outer_type=list, outer_repr=[[('a', [1, 2]), ('b', [3])], [('c', [4, 5, 6])]] |
+-------------------------------------------------------------------------------+

now:

SELECT py_deep_nested_debug([{'a': [1, 2], 'b': [3]}, {'c': [4, 5, 6]}]);
+-------------------------------------------------------------------------+
| py_deep_nested_debug([{'a': [1, 2], 'b': [3]}, {'c': [4, 5, 6]}])       |
+-------------------------------------------------------------------------+
| outer_type=list, outer_repr=[{'a': [1, 2], 'b': [3]}, {'c': [4, 5, 6]}] |
+-------------------------------------------------------------------------+

@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@linrrzqqq
Copy link
Copy Markdown
Collaborator Author

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary opinion: Request changes. The PR fixes the intended nested MAP representation problem, but the current implementation also adds per-row Python conversion to every vectorized UDF argument, including primitive pandas/list arguments that do not need normalization, and the new regression test has deterministic-output/debuggability issues that should be fixed before merge.

Critical checkpoint conclusions:

  • Goal/test: The goal is clear and the new test covers scalar, vector-list, vector-series, and mixed vector/scalar nested MAP cases, but one expected-result query is not ordered.
  • Scope/focus: The feature change is small, but the conversion is applied more broadly than necessary and affects all vectorized UDF inputs.
  • Concurrency/lifecycle/config/compatibility/persistence/data writes: No new concurrency, lifecycle management, configs, storage format, persistence, or transaction behavior found in this PR.
  • Parallel paths: The scalar and vectorized Python UDF argument paths were both updated, but the vectorized primitive path now pays the nested conversion cost unnecessarily.
  • Error handling/observability: Existing exception propagation/logging remains unchanged; no new observability requirement found.
  • Test coverage/results: Coverage is relevant, but the new regression test violates Doris ordering and cleanup standards.
  • User focus: No additional user-provided review focus was specified.
  • Local verification: I inspected the diff and review context; I could not run a PyArrow reproduction because pyarrow is not installed in this runner environment.

Comment thread be/src/udf/python/python_server.py Outdated
Comment thread regression-test/suites/pythonudf_p0/test_pythonudf_nested_complex_type.groovy Outdated
Comment thread regression-test/suites/pythonudf_p0/test_pythonudf_nested_complex_type.groovy Outdated
@linrrzqqq
Copy link
Copy Markdown
Collaborator Author

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-H: Total hot run time: 30999 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 641b6ffd0a4763ee6c067ab5d6653a2dd8700568, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17677	4017	4013	4013
q2	q3	10880	1413	797	797
q4	4680	476	350	350
q5	7656	2261	2098	2098
q6	329	172	135	135
q7	937	798	662	662
q8	9357	1815	1565	1565
q9	6874	5066	4935	4935
q10	6465	2270	1866	1866
q11	446	281	248	248
q12	682	432	297	297
q13	18268	3394	2773	2773
q14	267	255	232	232
q15	q16	822	784	709	709
q17	1027	883	1022	883
q18	6756	5700	5461	5461
q19	1222	1220	1045	1045
q20	530	388	254	254
q21	5713	2610	2373	2373
q22	437	347	303	303
Total cold run time: 101025 ms
Total hot run time: 30999 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4358	4269	4246	4246
q2	q3	4564	5064	4328	4328
q4	2098	2199	1363	1363
q5	4390	4316	4657	4316
q6	255	195	148	148
q7	2031	1817	1611	1611
q8	2484	2126	2129	2126
q9	8001	7997	7949	7949
q10	4837	4873	4341	4341
q11	568	409	378	378
q12	734	773	530	530
q13	3277	3722	3064	3064
q14	298	294	283	283
q15	q16	778	732	666	666
q17	1339	1297	1306	1297
q18	7817	7355	6875	6875
q19	1129	1085	1104	1085
q20	2232	2218	1946	1946
q21	5252	4571	4405	4405
q22	508	446	405	405
Total cold run time: 56950 ms
Total hot run time: 51362 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-DS: Total hot run time: 170853 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 641b6ffd0a4763ee6c067ab5d6653a2dd8700568, data reload: false

query5	4316	662	520	520
query6	324	218	198	198
query7	4229	572	322	322
query8	325	234	253	234
query9	8792	4010	3944	3944
query10	449	353	296	296
query11	5775	2408	2254	2254
query12	183	129	127	127
query13	1310	630	453	453
query14	6095	5471	5138	5138
query14_1	4458	4470	4430	4430
query15	215	203	193	193
query16	988	424	450	424
query17	1151	760	613	613
query18	2737	500	365	365
query19	221	204	175	175
query20	139	133	132	132
query21	218	140	115	115
query22	13665	13583	13340	13340
query23	17320	16536	16103	16103
query23_1	16379	16252	16278	16252
query24	7510	1778	1315	1315
query24_1	1304	1308	1321	1308
query25	550	471	420	420
query26	1337	346	168	168
query27	2612	551	351	351
query28	4414	1996	1987	1987
query29	975	629	499	499
query30	310	237	196	196
query31	1125	1077	952	952
query32	87	74	69	69
query33	553	350	296	296
query34	1197	1138	637	637
query35	759	782	707	707
query36	1389	1437	1216	1216
query37	152	99	93	93
query38	3229	3200	3056	3056
query39	924	930	894	894
query39_1	879	899	863	863
query40	231	145	122	122
query41	65	63	60	60
query42	110	107	106	106
query43	323	355	287	287
query44	
query45	214	198	206	198
query46	1055	1209	754	754
query47	2394	2364	2306	2306
query48	400	410	295	295
query49	618	494	378	378
query50	1024	346	252	252
query51	4356	4350	4272	4272
query52	105	105	92	92
query53	284	280	199	199
query54	315	275	262	262
query55	94	94	93	93
query56	326	311	301	301
query57	1472	1482	1367	1367
query58	301	269	272	269
query59	1655	1731	1516	1516
query60	334	323	313	313
query61	160	157	160	157
query62	706	654	588	588
query63	234	200	204	200
query64	2389	822	637	637
query65	
query66	1677	485	351	351
query67	29779	29710	29461	29461
query68	
query69	473	341	305	305
query70	998	1004	941	941
query71	304	279	266	266
query72	3053	2711	2444	2444
query73	834	765	428	428
query74	5119	4994	4759	4759
query75	2696	2607	2276	2276
query76	2297	1132	763	763
query77	416	411	330	330
query78	12325	12525	11794	11794
query79	1468	1032	754	754
query80	644	545	449	449
query81	457	285	248	248
query82	1386	165	122	122
query83	363	292	244	244
query84	261	147	112	112
query85	876	539	486	486
query86	407	355	338	338
query87	3420	3394	3235	3235
query88	3610	2772	2741	2741
query89	443	390	349	349
query90	1859	181	178	178
query91	179	184	146	146
query92	82	78	71	71
query93	1482	1564	903	903
query94	543	366	310	310
query95	687	375	364	364
query96	1014	857	315	315
query97	2729	2744	2610	2610
query98	249	233	225	225
query99	1208	1136	1038	1038
Total cold run time: 254346 ms
Total hot run time: 170853 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 100% (0/0) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 73.84% (28129/38095)
Line Coverage 57.80% (305729/528924)
Region Coverage 55.03% (256137/465429)
Branch Coverage 56.47% (110514/195709)

@linrrzqqq
Copy link
Copy Markdown
Collaborator Author

/review

@github-actions
Copy link
Copy Markdown
Contributor

OpenCode automated review failed and did not complete.

Error: Error: Token refresh failed: 401
Workflow run: https://github.com/apache/doris/actions/runs/26718293427

Please inspect the workflow logs and rerun the review after the underlying issue is resolved.

@linrrzqqq
Copy link
Copy Markdown
Collaborator Author

/review

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 1, 2026

OpenCode automated review failed and did not complete.

Error: Error: Token refresh failed: 401
Workflow run: https://github.com/apache/doris/actions/runs/26732080872

Please inspect the workflow logs and rerun the review after the underlying issue is resolved.

@linrrzqqq
Copy link
Copy Markdown
Collaborator Author

run buildall

HappenLee
HappenLee previously approved these changes Jun 1, 2026
Copy link
Copy Markdown
Contributor

@HappenLee HappenLee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions github-actions Bot added the approved Indicates a PR has been approved by one committer. label Jun 1, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 1, 2026

PR approved by at least one committer and no changes requested.

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-H: Total hot run time: 32034 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit d701b803b74a0ede1d797a5e98768e45b4a4bcb0, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17640	4155	4178	4155
q2	q3	10744	1430	825	825
q4	4682	480	348	348
q5	7595	2319	2149	2149
q6	239	185	141	141
q7	972	774	638	638
q8	9354	1802	1633	1633
q9	5213	4989	4996	4989
q10	6383	2190	1872	1872
q11	449	297	251	251
q12	632	425	294	294
q13	18111	3416	2806	2806
q14	271	254	249	249
q15	q16	824	769	714	714
q17	1003	971	986	971
q18	6934	5766	5634	5634
q19	1358	1325	1224	1224
q20	581	465	283	283
q21	6312	2994	2555	2555
q22	463	367	303	303
Total cold run time: 99760 ms
Total hot run time: 32034 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	5174	5046	5050	5046
q2	q3	5006	5294	4648	4648
q4	2139	2249	1402	1402
q5	5148	4793	4712	4712
q6	251	183	134	134
q7	1889	1831	1579	1579
q8	2617	2246	2234	2234
q9	7876	7518	7459	7459
q10	4752	4677	4219	4219
q11	561	399	351	351
q12	739	742	529	529
q13	3050	3385	2771	2771
q14	274	287	268	268
q15	q16	685	689	638	638
q17	1309	1295	1279	1279
q18	7297	6858	6865	6858
q19	1180	1115	1121	1115
q20	2217	2235	1972	1972
q21	5347	4704	4580	4580
q22	527	489	418	418
Total cold run time: 58038 ms
Total hot run time: 52212 ms

@linrrzqqq
Copy link
Copy Markdown
Collaborator Author

/review

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-DS: Total hot run time: 172482 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit d701b803b74a0ede1d797a5e98768e45b4a4bcb0, data reload: false

query5	4311	664	523	523
query6	327	230	212	212
query7	4235	574	296	296
query8	328	241	229	229
query9	8789	4118	4104	4104
query10	443	350	305	305
query11	5799	2416	2233	2233
query12	192	132	129	129
query13	1291	626	454	454
query14	6095	5479	5247	5247
query14_1	4528	4563	4469	4469
query15	214	205	191	191
query16	983	451	448	448
query17	1109	753	595	595
query18	2465	497	360	360
query19	216	203	174	174
query20	143	135	131	131
query21	215	138	122	122
query22	13613	13649	13478	13478
query23	17461	16631	16271	16271
query23_1	16314	16343	16604	16343
query24	7461	1784	1310	1310
query24_1	1318	1351	1330	1330
query25	555	481	416	416
query26	1316	332	177	177
query27	2684	551	355	355
query28	4420	1986	1977	1977
query29	996	666	490	490
query30	319	236	197	197
query31	1144	1077	963	963
query32	91	77	73	73
query33	539	353	303	303
query34	1223	1137	662	662
query35	786	824	694	694
query36	1392	1404	1288	1288
query37	159	114	97	97
query38	3226	3202	3106	3106
query39	926	923	908	908
query39_1	911	880	879	879
query40	233	146	130	130
query41	65	65	62	62
query42	112	122	112	112
query43	352	340	311	311
query44	
query45	216	205	200	200
query46	1136	1210	761	761
query47	2387	2384	2143	2143
query48	398	409	305	305
query49	651	491	386	386
query50	1016	355	254	254
query51	4345	4340	4238	4238
query52	105	109	98	98
query53	260	282	203	203
query54	323	284	261	261
query55	97	94	87	87
query56	311	312	299	299
query57	1423	1441	1310	1310
query58	302	282	281	281
query59	1643	1673	1446	1446
query60	321	328	311	311
query61	161	159	178	159
query62	704	677	599	599
query63	248	210	218	210
query64	2491	855	702	702
query65	
query66	1764	504	382	382
query67	29796	29807	29622	29622
query68	
query69	482	359	324	324
query70	1025	1035	976	976
query71	327	280	278	278
query72	3196	2939	2534	2534
query73	883	766	445	445
query74	5162	4960	4800	4800
query75	2680	2590	2271	2271
query76	2285	1163	803	803
query77	416	418	342	342
query78	12533	12566	11863	11863
query79	1487	1034	761	761
query80	727	537	462	462
query81	461	280	243	243
query82	1353	154	121	121
query83	337	282	255	255
query84	278	143	111	111
query85	894	543	474	474
query86	397	356	345	345
query87	3441	3382	3258	3258
query88	3694	2764	2755	2755
query89	462	387	348	348
query90	1850	194	197	194
query91	180	203	139	139
query92	86	82	84	82
query93	1619	1505	831	831
query94	535	359	313	313
query95	685	504	356	356
query96	1069	775	347	347
query97	2708	2742	2610	2610
query98	237	233	235	233
query99	1170	1161	1037	1037
Total cold run time: 254931 ms
Total hot run time: 172482 ms

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code review completed for the latest head d701b80.

Findings: I did not find additional blocking issues beyond the already-existing inline threads, and the latest commit appears to address those known points: vector conversion is gated for non-nested Arrow types, the UNION ALL regression output is ordered, and the post-test table cleanup was removed.

Critical checkpoint conclusions:

  • Goal and coverage: The PR targets Python UDF nested MAP conversion under ARRAY/STRUCT/vectorized inputs, and the implementation plus new regression coverage exercise scalar constants, vector list, vector pandas Series, nulls, empty containers, and nested MAP inside STRUCT/MAP.
  • Scope and focus: The actual PR file list is limited to python_server.py and one regression suite/output pair; the change is focused on Arrow-to-Python conversion.
  • Concurrency: No new shared mutable state or thread lifecycle is introduced.
  • Lifecycle/static initialization: No new special lifecycle or static/global initialization concerns found.
  • Configuration: No config changes.
  • Compatibility/storage format: No persisted format or FE-BE protocol change.
  • Parallel paths: Scalar UDF, vectorized UDF, and existing UDAF/UDTF callers through convert_arrow_field_to_python are covered by the shared helper path; vectorized explicit list/Series paths are handled separately.
  • Conditional checks: The new needs_nested_python_normalization gate is appropriate to avoid applying recursive normalization on primitive/non-nested vector inputs.
  • Tests/results: New regression test output is deterministic after ordering the UNION ALL query. I did not run the regression suite in this review environment.
  • Observability: No additional observability appears necessary for this conversion helper.
  • Transaction/persistence/data writes: Not applicable.
  • FE/BE variable passing: Not applicable.
  • Performance: The known vectorized primitive regression concern is addressed by keeping the direct to_pylist/to_pandas path when the Arrow type cannot contain nested MAP values.

User focus: No additional user-provided review focus was present.

@hello-stephen
Copy link
Copy Markdown
Contributor

BE UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 54.02% (21024/38919)
Line Coverage 37.57% (199234/530346)
Region Coverage 33.85% (156089/461058)
Branch Coverage 34.83% (67891/194937)

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 100% (0/0) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 73.75% (28105/38108)
Line Coverage 57.73% (305377/528968)
Region Coverage 54.83% (255216/465458)
Branch Coverage 56.39% (110338/195663)

@github-actions github-actions Bot removed the approved Indicates a PR has been approved by one committer. label Jun 1, 2026
@linrrzqqq
Copy link
Copy Markdown
Collaborator Author

run buildall

@linrrzqqq
Copy link
Copy Markdown
Collaborator Author

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

BE UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 54.09% (21055/38925)
Line Coverage 37.61% (199517/530434)
Region Coverage 33.89% (156314/461254)
Branch Coverage 34.85% (67945/194965)

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 100% (0/0) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 73.77% (28116/38114)
Line Coverage 57.75% (305521/529057)
Region Coverage 55.00% (256137/465681)
Branch Coverage 56.43% (110436/195715)

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-H: Total hot run time: 28593 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 33cb1a26d7eb5b90aafbed2535e88be8486a9bc0, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17979	4174	3997	3997
q2	q3	10812	1355	807	807
q4	4700	474	347	347
q5	7587	864	598	598
q6	196	172	134	134
q7	802	836	647	647
q8	9894	1607	1614	1607
q9	6791	4457	4400	4400
q10	6821	1799	1518	1518
q11	436	269	252	252
q12	640	429	295	295
q13	18160	3394	2759	2759
q14	262	261	238	238
q15	q16	803	772	709	709
q17	931	863	884	863
q18	6685	5686	5499	5499
q19	1426	1317	1018	1018
q20	502	414	258	258
q21	6026	2618	2347	2347
q22	424	353	300	300
Total cold run time: 101877 ms
Total hot run time: 28593 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4320	4251	4230	4230
q2	q3	4450	4915	4259	4259
q4	2075	2218	1374	1374
q5	4415	4284	4292	4284
q6	228	178	129	129
q7	1730	1629	1983	1629
q8	2633	2228	2122	2122
q9	7969	7972	8099	7972
q10	4791	4792	4372	4372
q11	569	409	402	402
q12	763	764	540	540
q13	3426	3675	3063	3063
q14	287	326	267	267
q15	q16	707	770	667	667
q17	1370	1336	1358	1336
q18	7973	7624	7167	7167
q19	1119	1127	1116	1116
q20	2219	2214	1961	1961
q21	5253	4576	4444	4444
q22	516	461	418	418
Total cold run time: 56813 ms
Total hot run time: 51752 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-DS: Total hot run time: 171397 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 33cb1a26d7eb5b90aafbed2535e88be8486a9bc0, data reload: false

query5	4322	663	531	531
query6	329	219	204	204
query7	4218	583	314	314
query8	327	243	218	218
query9	8806	4080	4098	4080
query10	460	344	311	311
query11	5692	2392	2171	2171
query12	185	135	132	132
query13	1312	631	432	432
query14	6190	5502	5186	5186
query14_1	4516	4461	4394	4394
query15	212	207	186	186
query16	991	455	435	435
query17	1135	733	598	598
query18	2530	486	351	351
query19	223	200	171	171
query20	141	133	129	129
query21	216	143	119	119
query22	13680	13560	13353	13353
query23	17348	16502	16204	16204
query23_1	16346	16247	16399	16247
query24	7482	1766	1345	1345
query24_1	1340	1318	1332	1318
query25	596	523	469	469
query26	1329	356	184	184
query27	2637	565	357	357
query28	4467	2062	2004	2004
query29	1023	656	527	527
query30	313	239	209	209
query31	1154	1091	967	967
query32	94	78	76	76
query33	562	364	307	307
query34	1196	1143	677	677
query35	783	804	715	715
query36	1368	1403	1217	1217
query37	160	112	100	100
query38	3229	3193	3100	3100
query39	933	914	897	897
query39_1	876	878	888	878
query40	236	153	134	134
query41	72	73	70	70
query42	118	115	115	115
query43	340	340	305	305
query44	
query45	223	210	202	202
query46	1080	1242	754	754
query47	2393	2355	2216	2216
query48	410	435	314	314
query49	652	530	411	411
query50	1038	362	278	278
query51	4395	4302	4312	4302
query52	110	115	101	101
query53	260	304	220	220
query54	337	310	291	291
query55	101	93	92	92
query56	327	331	323	323
query57	1421	1437	1319	1319
query58	322	299	288	288
query59	1573	1709	1445	1445
query60	340	346	339	339
query61	222	219	160	160
query62	706	665	587	587
query63	247	210	206	206
query64	2383	806	633	633
query65	
query66	1700	483	355	355
query67	29902	29726	29615	29615
query68	
query69	477	345	316	316
query70	1084	1077	1029	1029
query71	305	290	274	274
query72	3033	2710	2397	2397
query73	873	765	442	442
query74	5089	4956	4801	4801
query75	2690	2607	2264	2264
query76	2283	1152	773	773
query77	402	410	340	340
query78	12438	12449	11830	11830
query79	1459	1070	768	768
query80	1146	539	478	478
query81	502	279	251	251
query82	1366	154	128	128
query83	353	277	263	263
query84	265	142	110	110
query85	919	531	459	459
query86	449	341	312	312
query87	3414	3371	3239	3239
query88	3663	2779	2775	2775
query89	455	396	342	342
query90	1791	182	194	182
query91	178	169	138	138
query92	77	77	76	76
query93	1595	1509	937	937
query94	646	366	308	308
query95	683	474	337	337
query96	1058	769	343	343
query97	2699	2707	2643	2643
query98	239	234	237	234
query99	1145	1168	1011	1011
Total cold run time: 255453 ms
Total hot run time: 171397 ms

@linrrzqqq
Copy link
Copy Markdown
Collaborator Author

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated review summary:

No additional blocking issues found in the current PR head.

Critical checkpoint conclusions:

  • Goal and proof: The PR fixes nested MAP conversion for scalar and vectorized Python UDF inputs, and adds a regression case covering scalar constants, vectorized list inputs, vectorized pandas Series inputs, and mixed vector/scalar arguments.
  • Scope: The effective GitHub PR diff is focused on be/src/udf/python/python_server.py plus the new regression test and expected output.
  • Concurrency/lifecycle: No new shared state, threads, locks, static initialization, or lifecycle-sensitive ownership was introduced.
  • Configuration/compatibility: No new configuration, storage format, Thrift/protocol, or persisted metadata changes.
  • Parallel paths: Scalar conversion, vectorized list conversion, vectorized pandas conversion, and vectorized mixed scalar conversion were considered. The current head keeps primitive vectorized inputs on the original fast path and normalizes only map-containing nested types.
  • Error handling/data correctness: The recursive conversion preserves NULL handling and converts MAP/LIST/STRUCT according to Arrow type metadata. I did not find a distinct data-correctness issue beyond the already-known prior review threads.
  • Tests: The new regression output is deterministic in the current head; table cleanup now follows the Doris regression convention. I did not run the regression suite in this review environment.
  • Observability/performance: No new observability requirement. The current head avoids the earlier performance regression by gating recursive normalization to types that can contain nested MAP values.

User focus: No additional user-provided review focus was specified.

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 100% (0/0) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 73.73% (28100/38114)
Line Coverage 57.69% (305213/529057)
Region Coverage 54.96% (255937/465681)
Branch Coverage 56.36% (110304/195715)

Copy link
Copy Markdown
Contributor

@HappenLee HappenLee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 2, 2026

PR approved by at least one committer and no changes requested.

@github-actions github-actions Bot added the approved Indicates a PR has been approved by one committer. label Jun 2, 2026
@zclllyybb zclllyybb merged commit d073c95 into apache:master Jun 2, 2026
31 of 32 checks passed
@linrrzqqq linrrzqqq deleted the pyudf-nested-map branch June 2, 2026 03:04
linrrzqqq added a commit to linrrzqqq/doris that referenced this pull request Jun 2, 2026
Problem Summary:

Fix Python UDF nested complex type conversion when `MAP` appears inside
`ARRAY`, `STRUCT`, or vectorized inputs.

Previously, Python UDF argument conversion mostly relied on PyArrow's
default conversions(`Scalar.as_py()`, `Array.to_pylist()`,
`Array.to_pandas()`). Those APIs convert a top-level Arrow `MAP` into
Python-friendly values in some paths, but nested `MAP` values are
exposed as list-of-tuples. For example, `ARRAY<MAP<STRING, INT>>` could
arrive in Python as `[[('a', 1)]]` instead of `[{'a': 1}]`. This made
user UDF code see nested maps as `list` instead of `dict`.

This PR introduces a recursive Arrow-value conversion helper and applies
it consistently across Python UDF argument conversion paths. The helper
manually reconstructs Python values according to the Arrow type:
- `MAP` -> `dict`
- `LIST` / `LARGE_LIST` -> `list`
- `STRUCT` -> `dict`

before
```sql
CREATE FUNCTION py_deep_nested_debug(ARRAY<MAP<STRING, ARRAY<INT>>> )
RETURNS STRING
PROPERTIES (
    "type" = "PYTHON_UDF",
    "symbol" = "evaluate",
    "runtime_version" = "3.12.11",
    "always_nullable" = "true"
)
AS $$
def evaluate(arr):
    if arr is None:
        return 'None'
    return 'outer_type={}, outer_repr={}'.format(type(arr).__name__, repr(arr))
$$;

SELECT py_deep_nested_debug([{'a': [1, 2], 'b': [3]}, {'c': [4, 5, 6]}]);
+-------------------------------------------------------------------------------+
| py_deep_nested_debug([{'a': [1, 2], 'b': [3]}, {'c': [4, 5, 6]}])             |
+-------------------------------------------------------------------------------+
| outer_type=list, outer_repr=[[('a', [1, 2]), ('b', [3])], [('c', [4, 5, 6])]] |
+-------------------------------------------------------------------------------+
```

now:
```text
SELECT py_deep_nested_debug([{'a': [1, 2], 'b': [3]}, {'c': [4, 5, 6]}]);
+-------------------------------------------------------------------------+
| py_deep_nested_debug([{'a': [1, 2], 'b': [3]}, {'c': [4, 5, 6]}])       |
+-------------------------------------------------------------------------+
| outer_type=list, outer_repr=[{'a': [1, 2], 'b': [3]}, {'c': [4, 5, 6]}] |
+-------------------------------------------------------------------------+
```
zhaorongsheng pushed a commit to zhaorongsheng/doris that referenced this pull request Jun 4, 2026
Problem Summary:

Fix Python UDF nested complex type conversion when `MAP` appears inside
`ARRAY`, `STRUCT`, or vectorized inputs.

Previously, Python UDF argument conversion mostly relied on PyArrow's
default conversions(`Scalar.as_py()`, `Array.to_pylist()`,
`Array.to_pandas()`). Those APIs convert a top-level Arrow `MAP` into
Python-friendly values in some paths, but nested `MAP` values are
exposed as list-of-tuples. For example, `ARRAY<MAP<STRING, INT>>` could
arrive in Python as `[[('a', 1)]]` instead of `[{'a': 1}]`. This made
user UDF code see nested maps as `list` instead of `dict`.

This PR introduces a recursive Arrow-value conversion helper and applies
it consistently across Python UDF argument conversion paths. The helper
manually reconstructs Python values according to the Arrow type:
- `MAP` -> `dict`
- `LIST` / `LARGE_LIST` -> `list`
- `STRUCT` -> `dict`

before
```sql
CREATE FUNCTION py_deep_nested_debug(ARRAY<MAP<STRING, ARRAY<INT>>> )
RETURNS STRING
PROPERTIES (
    "type" = "PYTHON_UDF",
    "symbol" = "evaluate",
    "runtime_version" = "3.12.11",
    "always_nullable" = "true"
)
AS $$
def evaluate(arr):
    if arr is None:
        return 'None'
    return 'outer_type={}, outer_repr={}'.format(type(arr).__name__, repr(arr))
$$;

SELECT py_deep_nested_debug([{'a': [1, 2], 'b': [3]}, {'c': [4, 5, 6]}]);
+-------------------------------------------------------------------------------+
| py_deep_nested_debug([{'a': [1, 2], 'b': [3]}, {'c': [4, 5, 6]}])             |
+-------------------------------------------------------------------------------+
| outer_type=list, outer_repr=[[('a', [1, 2]), ('b', [3])], [('c', [4, 5, 6])]] |
+-------------------------------------------------------------------------------+
```

now:
```text
SELECT py_deep_nested_debug([{'a': [1, 2], 'b': [3]}, {'c': [4, 5, 6]}]);
+-------------------------------------------------------------------------+
| py_deep_nested_debug([{'a': [1, 2], 'b': [3]}, {'c': [4, 5, 6]}])       |
+-------------------------------------------------------------------------+
| outer_type=list, outer_repr=[{'a': [1, 2], 'b': [3]}, {'c': [4, 5, 6]}] |
+-------------------------------------------------------------------------+
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants