Skip to content

[fix](be) Fix variant inverted-index cast pushdown for int and boolean#63118

Closed
wuguowei1994 wants to merge 1 commit into
apache:masterfrom
wuguowei1994:fix-variant-inverted-index-cast
Closed

[fix](be) Fix variant inverted-index cast pushdown for int and boolean#63118
wuguowei1994 wants to merge 1 commit into
apache:masterfrom
wuguowei1994:fix-variant-inverted-index-cast

Conversation

@wuguowei1994
Copy link
Copy Markdown

@wuguowei1994 wuguowei1994 commented May 10, 2026

Summary

On current master, inverted index pushdown is not correctly applied for some casted VARIANT predicates.

This PR focuses on two production-critical patterns:

  1. CAST(v["int_key"] AS INT) = <value>
  2. CAST(v["bool_key"] AS BOOLEAN) (equivalent to = true in filter context)

In our production workloads, VARIANT is heavily used and teams are required to query subfields with explicit CAST.
When these predicates cannot use inverted index filtering, query latency and resource usage increase significantly.


Business Context

In our production workloads, each business table may contain a large number of dynamic JSON keys (often hundreds of keys over time).

Because of this, it is not feasible to predefine typed paths for every subkey.

Under this constraint, the common usage pattern across teams is:

  • define a generic VARIANT column plus inverted index on the whole VARIANT:
    v VARIANT,
    INDEX idx_v(v) USING INVERTED
  • continuously write records with different dynamic keys into v
  • query subfields with explicit CAST (for example CAST(v["int_key"] AS INT))

The issue appears exactly on this mainstream pattern: if casted subfield predicates cannot leverage inverted-index pushdown, scans degrade toward near full-row scanning on large datasets, which causes major latency and resource impact in production.


Reproduction

Case 1: INT cast equality

DROP TABLE IF EXISTS variant_inverted_intkey_test;

CREATE TABLE variant_inverted_intkey_test (
    row_id BIGINT,
    v VARIANT,
    INDEX idx_v(v) USING INVERTED
)
ENGINE=OLAP
DUPLICATE KEY(row_id)
DISTRIBUTED BY HASH(row_id) BUCKETS 1
PROPERTIES (
    "replication_num" = "1",
    "disable_auto_compaction" = "true",
    "inverted_index_storage_format" = "v2"
);

INSERT INTO variant_inverted_intkey_test VALUES
(1,  '{"int_key": 1}'),
(2,  '{"int_key": 2}'),
(3,  '{"int_key": 3}'),
(4,  '{"int_key": 4}'),
(5,  '{"int_key": 5}'),
(6,  '{"int_key": 6}'),
(7,  '{"int_key": 7}'),
(8,  '{"int_key": 8}'),
(9,  '{"int_key": 9}'),
(10, '{"int_key": 10}'),
(11, '{"int_key": 11}'),
(12, '{"int_key": 12}'),
(13, '{"int_key": 13}'),
(14, '{"int_key": 14}'),
(15, '{"int_key": 15}'),
(16, '{"int_key": 16}'),
(17, '{"int_key": 17}'),
(18, '{"int_key": 18}'),
(19, '{"int_key": 19}'),
(20, '{"int_key": 20}');

SELECT row_id, CAST(v["int_key"] AS INT) AS int_key
FROM variant_inverted_intkey_test
WHERE CAST(v["int_key"] AS INT) = 13;

Case 2: BOOLEAN cast predicate

DROP TABLE IF EXISTS variant_inverted_boolkey_test;

CREATE TABLE variant_inverted_boolkey_test (
    row_id BIGINT,
    v VARIANT,
    INDEX idx_v(v) USING INVERTED
)
ENGINE=OLAP
DUPLICATE KEY(row_id)
DISTRIBUTED BY HASH(row_id) BUCKETS 1
PROPERTIES (
    "replication_num" = "1",
    "disable_auto_compaction" = "true",
    "inverted_index_storage_format" = "v2"
);

INSERT INTO variant_inverted_boolkey_test VALUES
(1, '{"bool_key": true}'),
(2, '{"bool_key": false}');

SELECT row_id
FROM variant_inverted_boolkey_test
WHERE CAST(v["bool_key"] AS BOOLEAN);

Expected Behavior

  • For CAST(v["int_key"] AS INT) = 13, predicate should be pushed down and effectively filtered by inverted index.
  • For CAST(v["bool_key"] AS BOOLEAN), predicate should also be transformed/pushed down to an index-evaluable form.
  • Query profile should show effective index filtering (reduced scanned rows).

Actual Behavior (before this PR)

  • Query results are correct, but index filtering is not effectively applied.
  • Profile shows no effective inverted index filtering for these casted VARIANT predicates (for example RowsInvertedIndexFiltered = 0, scanned rows remain high).
  • This causes unnecessary row scanning and performance degradation.

Root Cause

  • For CAST(v["int_key"] AS INT) = ..., the query literal type is INT, while auto-inferred storage for integer subkeys is typically BIGINT.
  • In current master, this type mismatch is not normalized safely before inverted-index probing, so index evaluation is skipped and query falls back to row scanning.
  • For CAST(v["bool_key"] AS BOOLEAN) in filter context, current master does not build an index-evaluable predicate path for this cast form, so pushdown is also missed.

What this PR fixes

  • Enables inverted-index pushdown for casted VARIANT subcolumn predicates where the query literal can be safely normalized to the segment storage type.
  • Supports the production-critical cases:
    • CAST(v["int_key"] AS INT) = <value>
    • CAST(v["bool_key"] AS BOOLEAN) in filter context
  • Uses value-level conversion plus round-trip validation before probing the index, so lossy or out-of-range casts are skipped and fall back to normal scan.
  • Adds regression and unit coverage for positive pushdown and unsafe-conversion fallback cases.

@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@wuguowei1994 wuguowei1994 changed the title [fix](variant) VARIANT Inverted Index Predicate Pushdown Bug [fix](variant) allow inverted index pushdown for cast predicates on variant subcolumns May 10, 2026
@wuguowei1994 wuguowei1994 force-pushed the fix-variant-inverted-index-cast branch from e75111a to 904d4c0 Compare May 10, 2026 13:04
@eldenmoon
Copy link
Copy Markdown
Member

run buildall

@eldenmoon
Copy link
Copy Markdown
Member

/review

Copy link
Copy Markdown
Member

@eldenmoon eldenmoon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found a correctness blocker in the relaxed variant predicate compatibility check. The current regression covers only the same-width CAST(... AS INT) case, but this change also enables cross-width integer casts and same-family string casts without normalizing the predicate value to the segment storage encoding.

Comment thread be/src/storage/segment/segment.h Outdated
@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-H: Total hot run time: 29643 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 904d4c0549574f24d129b8dfb7f4d588b645f43e, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17613	3976	3964	3964
q2	q3	10719	948	611	611
q4	4659	460	356	356
q5	7447	1381	1137	1137
q6	195	179	140	140
q7	930	946	749	749
q8	9315	1364	1281	1281
q9	5637	5430	5335	5335
q10	6315	2098	1831	1831
q11	472	266	253	253
q12	651	415	288	288
q13	18162	3433	2740	2740
q14	290	284	262	262
q15	q16	913	878	785	785
q17	986	1108	770	770
q18	6513	5674	5598	5598
q19	1166	1286	1106	1106
q20	533	401	281	281
q21	4546	2297	1850	1850
q22	422	357	306	306
Total cold run time: 97484 ms
Total hot run time: 29643 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4184	4184	4127	4127
q2	q3	4620	4749	4177	4177
q4	2087	2175	1386	1386
q5	4975	4918	5209	4918
q6	188	165	133	133
q7	2022	1778	2016	1778
q8	3577	3321	3274	3274
q9	8454	8571	8576	8571
q10	4657	4591	4280	4280
q11	608	440	406	406
q12	698	771	519	519
q13	3522	3640	2891	2891
q14	298	304	291	291
q15	q16	805	805	676	676
q17	1369	1336	1284	1284
q18	7936	7102	7097	7097
q19	1202	1188	1163	1163
q20	2252	2211	1960	1960
q21	6142	5446	5411	5411
q22	717	548	415	415
Total cold run time: 60313 ms
Total hot run time: 54757 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-DS: Total hot run time: 170611 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 904d4c0549574f24d129b8dfb7f4d588b645f43e, data reload: false

query5	4333	652	535	535
query6	344	234	212	212
query7	4248	572	311	311
query8	340	250	222	222
query9	8867	4101	4098	4098
query10	463	353	302	302
query11	5831	2402	2183	2183
query12	185	139	129	129
query13	1280	608	416	416
query14	6069	5383	5072	5072
query14_1	4390	4370	4383	4370
query15	231	211	182	182
query16	1036	441	480	441
query17	1165	792	665	665
query18	2752	498	367	367
query19	235	209	181	181
query20	147	134	133	133
query21	217	139	119	119
query22	13635	14021	14420	14021
query23	17436	16492	16296	16296
query23_1	16347	16315	16302	16302
query24	7437	1819	1322	1322
query24_1	1357	1325	1357	1325
query25	565	476	429	429
query26	1310	318	174	174
query27	2689	599	344	344
query28	4308	1937	1948	1937
query29	999	616	519	519
query30	291	228	192	192
query31	1099	1039	934	934
query32	81	74	72	72
query33	543	334	288	288
query34	1190	1136	649	649
query35	753	792	663	663
query36	1306	1369	1204	1204
query37	148	105	87	87
query38	3209	3113	3056	3056
query39	970	911	902	902
query39_1	882	875	874	874
query40	232	149	136	136
query41	66	60	61	60
query42	107	105	106	105
query43	320	323	285	285
query44	
query45	210	202	185	185
query46	1059	1156	732	732
query47	2317	2284	2227	2227
query48	392	393	290	290
query49	631	535	432	432
query50	694	283	213	213
query51	4253	4208	4195	4195
query52	104	102	99	99
query53	243	278	204	204
query54	310	274	254	254
query55	91	89	85	85
query56	307	297	313	297
query57	1422	1376	1306	1306
query58	296	270	262	262
query59	1538	1555	1402	1402
query60	340	323	328	323
query61	167	159	168	159
query62	666	612	560	560
query63	239	195	206	195
query64	2404	808	684	684
query65	
query66	1689	502	398	398
query67	30071	29309	29172	29172
query68	
query69	468	334	302	302
query70	1028	1015	973	973
query71	323	280	270	270
query72	2922	2700	2478	2478
query73	851	758	446	446
query74	5033	4853	4755	4755
query75	2770	2655	2326	2326
query76	2296	1148	749	749
query77	404	425	349	349
query78	13015	13005	12308	12308
query79	1407	1029	761	761
query80	792	586	484	484
query81	467	276	235	235
query82	1300	160	127	127
query83	337	281	249	249
query84	256	143	110	110
query85	888	534	444	444
query86	428	334	310	310
query87	3441	3346	3234	3234
query88	3504	2652	2629	2629
query89	439	380	340	340
query90	1836	180	172	172
query91	189	166	142	142
query92	81	75	69	69
query93	973	958	549	549
query94	602	338	325	325
query95	690	384	343	343
query96	1076	802	338	338
query97	2698	2678	2549	2549
query98	245	229	228	228
query99	1127	1118	958	958
Total cold run time: 253780 ms
Total hot run time: 170611 ms

@wuguowei1994 wuguowei1994 force-pushed the fix-variant-inverted-index-cast branch 2 times, most recently from d4d86b8 to 6c2b533 Compare May 11, 2026 15:26
Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for updating the patch to avoid unsafe cross-width index encoding. I still think this needs changes because the latest exact-type rule no longer fixes the original CAST(v["int_key"] AS INT) scenario described in the PR body. Critical checkpoints: goal/test coverage is not satisfied because only same-width TINYINT pushdown is now proven while the documented INT reproduction remains non-pushdown; the code change is small and focused; no new concurrency, lifecycle, config, persistence, FE-BE protocol, or storage-format compatibility concerns were introduced; the main correctness risk is now an incomplete fix rather than wrong-result pushdown; observability is unchanged and adequate for this path through the existing debug/profile checks. User focus: no additional user-provided focus was specified.

Comment thread be/src/storage/segment/segment.h Outdated
@eldenmoon
Copy link
Copy Markdown
Member

currently only bigint in interger types will be infered

@wuguowei1994 wuguowei1994 force-pushed the fix-variant-inverted-index-cast branch from 6c2b533 to 3096cdc Compare May 17, 2026 03:00
@wuguowei1994 wuguowei1994 requested a review from airborne12 as a code owner May 17, 2026 03:00
@wuguowei1994 wuguowei1994 force-pushed the fix-variant-inverted-index-cast branch from 3096cdc to d397fa1 Compare May 17, 2026 04:11
@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-H: Total hot run time: 31829 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit f179c2575d9cb55551a08ccbd45b7e0958c26049, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17667	3927	3879	3879
q2	q3	10838	1421	815	815
q4	4683	478	348	348
q5	7582	2266	2140	2140
q6	246	178	139	139
q7	997	791	634	634
q8	9415	1716	1646	1646
q9	5175	4967	4915	4915
q10	6445	2068	1796	1796
q11	450	271	255	255
q12	641	435	286	286
q13	18132	3357	2780	2780
q14	263	253	238	238
q15	q16	814	771	705	705
q17	1013	1020	1018	1018
q18	6911	5797	5724	5724
q19	1315	1292	1116	1116
q20	680	453	295	295
q21	6004	2761	2726	2726
q22	453	374	423	374
Total cold run time: 99724 ms
Total hot run time: 31829 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4613	4534	4464	4464
q2	q3	4860	5297	4561	4561
q4	2127	2193	1380	1380
q5	4928	4686	4654	4654
q6	243	183	140	140
q7	1910	1723	1509	1509
q8	2382	2067	2077	2067
q9	7735	7184	7185	7184
q10	4459	4423	3970	3970
q11	524	374	350	350
q12	697	712	513	513
q13	2972	3391	2817	2817
q14	267	270	247	247
q15	q16	687	690	595	595
q17	1258	1246	1228	1228
q18	7435	6913	6686	6686
q19	1139	1101	1193	1101
q20	2224	2211	1947	1947
q21	5325	4575	4421	4421
q22	544	460	407	407
Total cold run time: 56329 ms
Total hot run time: 50241 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-DS: Total hot run time: 169281 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit f179c2575d9cb55551a08ccbd45b7e0958c26049, data reload: false

query5	4333	656	516	516
query6	325	212	218	212
query7	4237	577	301	301
query8	323	231	213	213
query9	8846	4008	3982	3982
query10	454	369	302	302
query11	5774	2391	2131	2131
query12	180	132	126	126
query13	1276	577	424	424
query14	5883	5294	5020	5020
query14_1	4312	4292	4305	4292
query15	207	201	180	180
query16	992	469	427	427
query17	962	701	568	568
query18	2443	501	357	357
query19	213	207	175	175
query20	135	133	132	132
query21	216	139	127	127
query22	13665	13487	13329	13329
query23	17171	16449	16051	16051
query23_1	16080	16221	16179	16179
query24	7700	1788	1322	1322
query24_1	1288	1294	1321	1294
query25	581	539	440	440
query26	1344	315	171	171
query27	2717	560	349	349
query28	4548	1993	2000	1993
query29	974	599	485	485
query30	301	244	196	196
query31	1106	1053	942	942
query32	95	75	72	72
query33	537	339	285	285
query34	1175	1095	617	617
query35	741	776	662	662
query36	1332	1332	1228	1228
query37	152	101	93	93
query38	3193	3103	3036	3036
query39	915	912	882	882
query39_1	875	889	881	881
query40	232	148	126	126
query41	68	64	63	63
query42	117	110	107	107
query43	329	329	281	281
query44	
query45	207	202	196	196
query46	1069	1169	728	728
query47	2331	2383	2237	2237
query48	358	432	301	301
query49	620	484	373	373
query50	1033	359	267	267
query51	4337	4262	4203	4203
query52	108	104	93	93
query53	257	282	207	207
query54	311	269	256	256
query55	92	99	88	88
query56	299	318	306	306
query57	1401	1417	1313	1313
query58	299	272	271	271
query59	1614	1647	1388	1388
query60	319	324	302	302
query61	160	157	158	157
query62	665	632	565	565
query63	245	202	200	200
query64	2417	819	705	705
query65	
query66	1749	493	368	368
query67	29997	30019	29867	29867
query68	
query69	472	363	334	334
query70	1038	983	1002	983
query71	305	287	273	273
query72	3236	2956	2449	2449
query73	854	737	435	435
query74	5044	4876	4739	4739
query75	2623	2577	2253	2253
query76	2268	1147	798	798
query77	399	406	334	334
query78	12092	12155	11547	11547
query79	1462	1025	739	739
query80	763	532	434	434
query81	472	277	250	250
query82	1365	156	121	121
query83	366	274	242	242
query84	300	145	113	113
query85	910	519	446	446
query86	436	343	305	305
query87	3389	3384	3193	3193
query88	3503	2654	2648	2648
query89	442	382	338	338
query90	1867	178	179	178
query91	179	166	141	141
query92	74	77	74	74
query93	1613	1461	836	836
query94	592	333	314	314
query95	673	376	347	347
query96	1077	783	349	349
query97	2703	2666	2560	2560
query98	234	232	231	231
query99	1131	1114	984	984
Total cold run time: 253546 ms
Total hot run time: 169281 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

BE UT Coverage Report

Increment line coverage 0.00% (0/116) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 53.48% (20640/38593)
Line Coverage 37.14% (195065/525166)
Region Coverage 33.50% (152614/455521)
Branch Coverage 34.54% (66563/192685)

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 27.59% (32/116) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 63.28% (23915/37793)
Line Coverage 46.99% (246150/523789)
Region Coverage 43.95% (202153/459936)
Branch Coverage 45.21% (87442/193415)

@wuguowei1994 wuguowei1994 changed the title [fix](variant) allow inverted index pushdown for cast predicates on variant subcolumns [fix](variant) Support safe widening cast pushdown for variant inverted indexes May 17, 2026
@wuguowei1994 wuguowei1994 force-pushed the fix-variant-inverted-index-cast branch 3 times, most recently from 37a0544 to 571e6dd Compare May 17, 2026 14:55
@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-H: Total hot run time: 31058 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 571e6ddf8482971cb3006dc9e6e6324ddd22e21a, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17620	3814	3864	3814
q2	q3	10765	1449	802	802
q4	4718	488	352	352
q5	8260	2297	2149	2149
q6	323	175	139	139
q7	975	798	626	626
q8	9359	1705	1562	1562
q9	6852	4938	4896	4896
q10	6428	2154	1824	1824
q11	438	278	244	244
q12	693	417	288	288
q13	18253	3440	2805	2805
q14	268	253	239	239
q15	q16	821	767	712	712
q17	999	925	913	913
q18	6986	5880	5661	5661
q19	1195	1311	1093	1093
q20	514	404	265	265
q21	5784	2576	2375	2375
q22	444	362	299	299
Total cold run time: 101695 ms
Total hot run time: 31058 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4197	4126	4110	4110
q2	q3	4503	4944	4367	4367
q4	2133	2235	1411	1411
q5	4435	4292	4672	4292
q6	254	216	157	157
q7	1991	1795	1619	1619
q8	2463	2108	2273	2108
q9	7843	7898	7745	7745
q10	4579	4479	4377	4377
q11	592	409	372	372
q12	722	739	525	525
q13	3303	3654	3064	3064
q14	311	299	259	259
q15	q16	717	749	659	659
q17	1376	1333	1322	1322
q18	7815	7317	6868	6868
q19	1100	1079	1081	1079
q20	2240	2221	1945	1945
q21	5342	4703	4456	4456
q22	535	471	405	405
Total cold run time: 56451 ms
Total hot run time: 51140 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-DS: Total hot run time: 169495 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 571e6ddf8482971cb3006dc9e6e6324ddd22e21a, data reload: false

query5	4321	650	515	515
query6	331	227	201	201
query7	4275	548	313	313
query8	338	230	216	216
query9	8817	4117	4077	4077
query10	454	354	291	291
query11	5815	2361	2170	2170
query12	180	131	130	130
query13	1272	570	436	436
query14	6026	5406	5072	5072
query14_1	4380	4385	4388	4385
query15	219	207	188	188
query16	1042	467	395	395
query17	1177	706	583	583
query18	2727	491	352	352
query19	217	200	155	155
query20	133	137	129	129
query21	207	143	116	116
query22	13565	13598	13461	13461
query23	17293	16427	16037	16037
query23_1	16232	16150	16293	16150
query24	7430	1762	1299	1299
query24_1	1317	1289	1296	1289
query25	552	476	442	442
query26	1329	322	176	176
query27	2670	587	329	329
query28	4477	1934	1976	1934
query29	1000	628	484	484
query30	307	239	190	190
query31	1112	1074	950	950
query32	89	77	75	75
query33	529	346	296	296
query34	1159	1098	648	648
query35	774	773	668	668
query36	1345	1338	1138	1138
query37	156	105	91	91
query38	3226	3161	3047	3047
query39	951	943	902	902
query39_1	874	871	863	863
query40	228	149	125	125
query41	66	65	64	64
query42	112	115	110	110
query43	328	334	300	300
query44	
query45	221	203	201	201
query46	1089	1180	709	709
query47	2295	2308	2183	2183
query48	407	423	301	301
query49	656	509	417	417
query50	1042	363	257	257
query51	4406	4226	4294	4226
query52	110	108	101	101
query53	266	286	215	215
query54	333	285	272	272
query55	100	96	89	89
query56	324	335	316	316
query57	1410	1402	1262	1262
query58	306	290	278	278
query59	1606	1673	1459	1459
query60	360	336	325	325
query61	181	175	178	175
query62	680	625	559	559
query63	258	207	216	207
query64	2449	896	702	702
query65	
query66	1689	536	357	357
query67	30112	30036	29923	29923
query68	
query69	438	331	308	308
query70	1049	973	955	955
query71	298	274	262	262
query72	2954	2750	2378	2378
query73	848	719	427	427
query74	5129	4917	4742	4742
query75	2687	2612	2264	2264
query76	2305	1142	742	742
query77	406	417	339	339
query78	12270	12129	11631	11631
query79	1448	1001	757	757
query80	1323	546	448	448
query81	508	275	238	238
query82	1350	157	120	120
query83	355	278	244	244
query84	260	137	113	113
query85	935	542	453	453
query86	453	340	307	307
query87	3443	3360	3215	3215
query88	3503	2641	2630	2630
query89	454	386	339	339
query90	1787	197	186	186
query91	179	167	141	141
query92	78	78	77	77
query93	1519	1566	834	834
query94	673	361	313	313
query95	655	382	362	362
query96	1020	771	320	320
query97	2714	2680	2547	2547
query98	241	229	229	229
query99	1115	1112	1010	1010
Total cold run time: 254997 ms
Total hot run time: 169495 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 38.32% (41/107) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 57.43% (21703/37792)
Line Coverage 40.69% (213108/523794)
Region Coverage 37.02% (170250/459925)
Branch Coverage 37.87% (73240/193418)

wuguowei1994 added a commit to wuguowei1994/doris that referenced this pull request May 18, 2026
### What problem does this PR solve?

Related PR: apache#63118

Problem Summary: Variant subcolumn predicates such as
cast(v["int_key"] as bigint) IN (...) and
cast(v["float_key"] as double) = 1.5 were not being pushed down to the
inverted index. The FE plan wraps these as nested casts of the form
CAST(CAST(slot(v) AS storage_dtype) AS user_target), but three BE code
paths only accepted a single-level CAST(slot):

- _filter_and_collect_cast_type_for_variant() returned early when the
  outer cast's child was not SLOT_REF, so it never recorded the user
  target type and the slot kept its original VARIANT value range.
- is_valid_push_down_cast() required children[0]->children().at(0) to
  be a slot ref, so _is_predicate_acting_on_slot() rejected nested
  casts and skipped column-predicate construction.
- _evaluate_inverted_index() in the common-expr-pushdown path also
  required cast_expr->get_child(0)->is_slot_ref(), so the second-pass
  index probe could not peel the cast either.

This change peels the whole cast chain via VExpr::expr_without_cast()
in all three places, while still using the outermost cast's target
type for compatibility / round-trip checks. After the fix the column
predicate path constructs widened ColumnValueRange / ColumnPredicate
correctly, and convert_to_storage_value() normalizes each literal back
to the storage type before probing.

### Release note

None

### Check List (For Author)

- Test: No need to test (covered by regression tests in apache#63118 once
  the test expectations are aligned in the follow-up commit)
- Behavior changed: No
- Does this need documentation: No

Co-authored-by: Cursor <cursoragent@cursor.com>
wuguowei1994 added a commit to wuguowei1994/doris that referenced this pull request May 18, 2026
…_without_cast

### What problem does this PR solve?

Related PR: apache#63118

Problem Summary: The original regression-test/suites/inverted_index_p0/
test_variant_inverted_index_cast.groovy had three issues:

- The IndexFilter section and HitRows counters appear in the profile as
  soon as the segment iterator initializes its inverted-index runtime,
  even when no predicate was actually pushed down. Asserting on
  profileText.contains("IndexFilter:") and on HitRows is therefore
  brittle. Switch the "index used" / "index not used" judgement to
  RowsInvertedIndexFiltered (rows the inverted index actually removed
  from the scan), which is exactly zero iff the index did not prune
  anything.
- The cast(v["int_key"] as double) = 13.0 case is folded by Nereids into
  CAST(v AS int) = 13 and goes through the existing INT equality path,
  not the new BE widening logic. Drop that case (and the related OR
  variant) since the assertion no longer matches the PR semantics.
- The cast(v["string_key"] as varchar(20)) case is wrapped by the FE as
  substring(CAST(CAST(v AS text) AS varchar(20)), 1, 20), which is
  outside the slot/cast-only contract this PR enables. Drop the case
  and document the limitation; the cast-to-text case continues to cover
  string-family widening.

To balance the deletions, add a real negative case
cast(v["int_key"] as bigint) = 5000000000 to verify that
convert_to_storage_value() rejects out-of-range literals at probe time
and falls back to full scan (RowsInvertedIndexFiltered == 0,
ScanRows == 20) while still returning the correct empty result.

Also add four lightweight unit tests in be/test/exprs/vexpr_test.cpp
to pin down VExpr::expr_without_cast() behavior (no-cast / single-level
/ nested / non-slot leaf), since the variant pushdown paths fixed in
the previous commit rely on it to find the underlying slot beneath a
chain of FE-emitted casts.

### Release note

None

### Check List (For Author)

- Test: Regression test (this commit only adjusts test expectations and
  adds new test coverage) / Unit Test
- Behavior changed: No
- Does this need documentation: No

Co-authored-by: Cursor <cursoragent@cursor.com>
wuguowei1994 added a commit to wuguowei1994/doris that referenced this pull request May 18, 2026
…ed indexes

### What problem does this PR solve?

Issue Number: None

Related PR: apache#63118

Problem Summary: Inverted index predicate pushdown did not support safe widening casts on indexed VARIANT subcolumns, so type-compatible cast predicates could not use the inverted index and scanned more rows than necessary. This change allows storage-compatible cast predicates to be pushed down and converts compatible query literals to the segment storage type before building inverted-index query values, keeping the query encoding consistent with the indexed storage representation.

### Release note

Support inverted index predicate pushdown for safe widening cast predicates on VARIANT subcolumns.

### Check List (For Author)

- Test: Added regression and unit test coverage
    - Regression test: test_variant_inverted_index_cast covers positive and negative VARIANT inverted-index cast pushdown cases.
    - Unit Test: vexpr_test and inverted_index_reader_test cover cast peeling and storage-value conversion.
- Behavior changed: Yes. Compatible VARIANT subcolumn cast predicates can now use inverted-index pushdown; unsupported or unsafe casts remain unpushed.
- Does this need documentation: No
@wuguowei1994 wuguowei1994 force-pushed the fix-variant-inverted-index-cast branch from 17f631e to 64bc776 Compare May 18, 2026 02:33
@hello-stephen
Copy link
Copy Markdown
Contributor

BE UT Coverage Report

Increment line coverage 52.38% (66/126) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 53.52% (20656/38592)
Line Coverage 37.17% (195192/525178)
Region Coverage 33.55% (152842/455500)
Branch Coverage 34.58% (66623/192690)

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 34.13% (43/126) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 58.11% (21961/37794)
Line Coverage 41.54% (217578/523801)
Region Coverage 38.07% (175111/459912)
Branch Coverage 38.70% (74847/193416)

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-H: Total hot run time: 31486 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 3895a0878a68ec9a167fcb8e6039bb404ec91fd3, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17640	3886	3874	3874
q2	q3	10771	1473	847	847
q4	4683	475	347	347
q5	7652	2327	2157	2157
q6	240	178	138	138
q7	981	785	633	633
q8	9414	1823	1584	1584
q9	5474	4940	4937	4937
q10	6417	2085	1805	1805
q11	425	281	249	249
q12	632	428	298	298
q13	18135	3434	2815	2815
q14	263	256	234	234
q15	q16	821	776	713	713
q17	960	864	954	864
q18	7184	5717	5537	5537
q19	1369	1267	1148	1148
q20	672	459	297	297
q21	6098	2904	2698	2698
q22	525	381	311	311
Total cold run time: 100356 ms
Total hot run time: 31486 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4636	4547	4495	4495
q2	q3	4920	5260	4579	4579
q4	2144	2241	1434	1434
q5	5077	4597	4672	4597
q6	232	190	136	136
q7	1912	1744	1558	1558
q8	2355	2142	2085	2085
q9	7597	7260	7296	7260
q10	4502	4426	3986	3986
q11	529	403	353	353
q12	721	720	496	496
q13	3014	3426	2773	2773
q14	284	279	264	264
q15	q16	681	694	614	614
q17	1279	1235	1248	1235
q18	7231	6695	6799	6695
q19	1106	1074	1080	1074
q20	2231	2225	1945	1945
q21	5325	4675	4541	4541
q22	511	459	417	417
Total cold run time: 56287 ms
Total hot run time: 50537 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-DS: Total hot run time: 169788 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 3895a0878a68ec9a167fcb8e6039bb404ec91fd3, data reload: false

query5	4343	661	544	544
query6	331	225	212	212
query7	4230	580	288	288
query8	329	235	230	230
query9	8825	3999	3986	3986
query10	460	344	294	294
query11	5779	2402	2273	2273
query12	206	125	123	123
query13	1300	589	417	417
query14	5889	5362	5042	5042
query14_1	4305	4334	4296	4296
query15	209	209	185	185
query16	1013	445	436	436
query17	940	721	591	591
query18	2438	492	351	351
query19	210	194	157	157
query20	134	130	130	130
query21	210	137	119	119
query22	13537	13541	13610	13541
query23	17211	16344	16006	16006
query23_1	16258	16189	16111	16111
query24	7441	1741	1312	1312
query24_1	1291	1297	1332	1297
query25	547	473	412	412
query26	1324	337	167	167
query27	2722	548	337	337
query28	4488	1971	1966	1966
query29	972	607	504	504
query30	310	242	202	202
query31	1105	1075	930	930
query32	91	77	71	71
query33	538	368	299	299
query34	1199	1119	659	659
query35	764	797	695	695
query36	1346	1315	1162	1162
query37	158	106	94	94
query38	3228	3136	3042	3042
query39	936	944	891	891
query39_1	902	875	890	875
query40	362	163	131	131
query41	73	70	70	70
query42	115	113	110	110
query43	332	330	289	289
query44	
query45	216	210	201	201
query46	1142	1191	719	719
query47	2361	2379	2224	2224
query48	412	429	318	318
query49	657	517	407	407
query50	957	348	257	257
query51	4375	4341	4194	4194
query52	108	111	97	97
query53	271	285	213	213
query54	331	294	288	288
query55	99	96	101	96
query56	339	323	320	320
query57	1432	1427	1323	1323
query58	314	292	282	282
query59	1559	1615	1499	1499
query60	337	339	323	323
query61	187	182	176	176
query62	685	653	559	559
query63	254	206	208	206
query64	2486	867	688	688
query65	
query66	1794	475	344	344
query67	30183	29937	29862	29862
query68	
query69	465	340	299	299
query70	1034	1003	983	983
query71	295	280	271	271
query72	2966	2698	2561	2561
query73	855	726	432	432
query74	5051	4947	4754	4754
query75	2674	2622	2251	2251
query76	2299	1138	765	765
query77	397	405	362	362
query78	12277	12236	11705	11705
query79	1437	1072	750	750
query80	661	547	456	456
query81	450	285	243	243
query82	1370	156	119	119
query83	354	272	247	247
query84	308	143	112	112
query85	910	552	451	451
query86	391	335	327	327
query87	3422	3393	3233	3233
query88	3535	2675	2662	2662
query89	433	394	341	341
query90	1985	182	184	182
query91	184	172	143	143
query92	78	80	75	75
query93	1573	1436	862	862
query94	545	367	281	281
query95	680	400	436	400
query96	1050	777	349	349
query97	2679	2698	2551	2551
query98	245	228	234	228
query99	1114	1116	996	996
Total cold run time: 253629 ms
Total hot run time: 169788 ms

@wuguowei1994 wuguowei1994 requested a review from eldenmoon May 19, 2026 15:46
@wuguowei1994 wuguowei1994 changed the title [fix](variant) Support safe widening cast pushdown for variant inverted indexes [fix](variant) Enable inverted index pushdown for widening-cast predicates on variant subcolumns Jun 1, 2026
wuguowei1994 added a commit to wuguowei1994/doris that referenced this pull request Jun 2, 2026
### What problem does this PR solve?

Issue Number: None

Related PR: apache#63118

Problem Summary: Remove redundant BE cast-chain peeling changes and broad regression coverage for cases already handled by existing FE rewrite rules. Keep this PR focused on the BE-side support still needed for safe widening casts that remain visible to inverted-index evaluation, such as auto-inferred BIGINT paths cast to LARGEINT and explicitly typed FLOAT paths compared through DOUBLE literals.

### Release note

None

### Check List (For Author)

- Test:
    - Static check: `git diff --cached --check`.
    - Manual test: previously compared master and this PR build on `10.106.128.180`.
    - Not run: full regression suite / BE UT after this cleanup.
- Behavior changed: No. This removes redundant code and narrows tests to the actual supported behavior.
- Does this need documentation: No
@wuguowei1994 wuguowei1994 force-pushed the fix-variant-inverted-index-cast branch from 832a4a0 to 6f6fc5b Compare June 2, 2026 10:27
@wuguowei1994 wuguowei1994 changed the title [fix](variant) Enable inverted index pushdown for widening-cast predicates on variant subcolumns [fix](be) Fix variant inverted-index cast pushdown for int and boolean Jun 2, 2026
wuguowei1994 added a commit to wuguowei1994/doris that referenced this pull request Jun 2, 2026
### What problem does this PR solve?

Issue Number: None

Related PR: apache#63118

Problem Summary: Fix a C++ compilation error where VCastExpr stored a shared_ptr IndexExecContext return value in an auto* variable.

### Release note

None

### Check List (For Author)

- Test: Not run (per request; user will run formatting and tests)
- Behavior changed: No
- Does this need documentation: No
wuguowei1994 added a commit to wuguowei1994/doris that referenced this pull request Jun 2, 2026
### What problem does this PR solve?

Issue Number: None

Related PR: apache#63118

Problem Summary: Add regression coverage for variant inverted-index cast pushdown when BIGINT storage values are outside the TINYINT range.

### Release note

None

### Check List (For Author)

- Test: Not run (per request)
- Behavior changed: No
- Does this need documentation: No
wuguowei1994 added a commit to wuguowei1994/doris that referenced this pull request Jun 2, 2026
### What problem does this PR solve?

Issue Number: None

Related PR: apache#63118

Problem Summary: Fix variant inverted-index cast pushdown so query literals are converted to the segment storage type before index probing, including positive IN predicates that narrow integer query types. This prevents column-level inverted-index evaluation from pruning valid rows for predicates such as cast(variant_path as tinyint) IN (...), and adds boundary coverage for out-of-range narrowing values.

### Release note

None

### Check List (For Author)

- Test: Manual test
    - Ran build-support/clang-format.sh be/src/storage/predicate/in_list_predicate.h
    - build-support/check-format.sh did not complete because local clang-format is not version 16
    - BE unit test was not completed: run-be-ut.sh attempted submodule setup/download and was blocked by sandbox/network; elevated retry was interrupted
    - Regression test was not run because this worktree has no output/ Doris cluster
- Behavior changed: No
- Does this need documentation: No
@wuguowei1994 wuguowei1994 force-pushed the fix-variant-inverted-index-cast branch 2 times, most recently from 4f9064f to 47b3565 Compare June 2, 2026 16:07
wuguowei1994 added a commit to wuguowei1994/doris that referenced this pull request Jun 2, 2026
### What problem does this PR solve?

Issue Number: close apache#63118

Related PR: apache#63118

Problem Summary: Fix the regression test expectation for tinyint overflow constants. These predicates are folded to an empty relation by FE, so no BE inverted-index profile counters are produced.

### Release note

None

### Check List (For Author)

- Test: Manual test
    - Verified from query profile that tinyint overflow predicates are planned as PhysicalEmptyRelation and do not produce inverted-index counters.
- Behavior changed: No
- Does this need documentation: No
wuguowei1994 added a commit to wuguowei1994/doris that referenced this pull request Jun 3, 2026
### What problem does this PR solve?

Issue Number: close apache#63118

Related PR: apache#63118

Problem Summary: Refine the variant inverted-index cast pushdown fix after review by clarifying integer cross-width conversion gates, documenting the round-trip invariant, reusing null-bitmap result construction, and extending conversion unit coverage.

### Release note

None

### Check List (For Author)

- Test: No need to test (not run per request)
- Behavior changed: No
- Does this need documentation: No
wuguowei1994 added a commit to wuguowei1994/doris that referenced this pull request Jun 3, 2026
### What problem does this PR solve?

Issue Number: close apache#63118

Related PR: apache#63118

Problem Summary: Add missing coverage for inverted-index cast pushdown conversion boundaries, including double/float round-trip behavior, int-to-decimal conversion, out-of-range integer skip, string compatibility, negative predicates, and large bigint variant values.

### Release note

None

### Check List (For Author)

- Test: No need to test (not run per request)
- Behavior changed: No
- Does this need documentation: No
wuguowei1994 added a commit to wuguowei1994/doris that referenced this pull request Jun 3, 2026
### What problem does this PR solve?

Issue Number: close apache#63118

Related PR: apache#63118

Problem Summary: Document why skipped literals are safe to ignore while building positive IN inverted-index results after storage-value conversion fails the round-trip check.

### Release note

None

### Check List (For Author)

- Test: No need to test (comment-only change)
- Behavior changed: No
- Does this need documentation: No
wuguowei1994 added a commit to wuguowei1994/doris that referenced this pull request Jun 3, 2026
### What problem does this PR solve?

Issue Number: close apache#63118

Related PR: apache#63118

Problem Summary: Add non-matching bigint variant rows so the large-value inverted-index profile assertion has rows to filter and remains stable.

### Release note

None

### Check List (For Author)

- Test: No need to test (not run per request)
- Behavior changed: No
- Does this need documentation: No
wuguowei1994 added a commit to wuguowei1994/doris that referenced this pull request Jun 3, 2026
### What problem does this PR solve?

Issue Number: close apache#63118

Related PR: apache#63118

Problem Summary: Keep the large bigint variant cast regression as a result-correctness case without requiring an unstable inverted-index profile counter for that key.

### Release note

None

### Check List (For Author)

- Test: No need to test (not run per request)
- Behavior changed: No
- Does this need documentation: No
wuguowei1994 added a commit to wuguowei1994/doris that referenced this pull request Jun 3, 2026
### What problem does this PR solve?

Issue Number: close apache#63118

Related PR: apache#63118

Problem Summary: Move the large bigint variant key into the main inserted batch so it is an extracted indexed subcolumn, then restore the profile assertion that verifies inverted-index filtering is used.

### Release note

None

### Check List (For Author)

- Test: No need to test (not run locally)
- Behavior changed: No
- Does this need documentation: No
wuguowei1994 added a commit to wuguowei1994/doris that referenced this pull request Jun 3, 2026
### What problem does this PR solve?

Issue Number: close apache#63118

Related PR: apache#63118

Problem Summary: Allow constant cast expressions on the literal side of comparison and IN predicates to continue through inverted-index evaluation. Large integer constants can be represented as cast literals, and the previous non-slot CAST_EXPR early return skipped index pushdown entirely.

### Release note

None

### Check List (For Author)

- Test: No need to test (not run locally)
- Behavior changed: No
- Does this need documentation: No
wuguowei1994 added a commit to wuguowei1994/doris that referenced this pull request Jun 3, 2026
### What problem does this PR solve?

Issue Number: close apache#63118

Related PR: apache#63118

Problem Summary: Revert the constant-cast inverted-index pushdown change because environment comparison showed the large bigint regression uses the inverted index on both old master and the PR build when the test data is shaped correctly. The prior source change was based on inference rather than a reproduced source defect.

### Release note

None

### Check List (For Author)

- Test: Manual test
    - Compared old master and PR build on the provided environment with the same large bigint variant query; both reported RowsInvertedIndexFiltered=19 and InvertedIndexQueryTime > 0.
- Behavior changed: No
- Does this need documentation: No
wuguowei1994 added a commit to wuguowei1994/doris that referenced this pull request Jun 3, 2026
### What problem does this PR solve?

Issue Number: close apache#63118

Related PR: apache#63118

Problem Summary: Keep the large bigint variant regression checking that inverted-index filtering is used, but do not require ScanRows to be exactly 1. Environment comparison showed the corrected query reports RowsInvertedIndexFiltered=19 and InvertedIndexQueryTime > 0 on both old master and the PR build, while ScanRows is 5.

### Release note

None

### Check List (For Author)

- Test: Manual test
    - Compared old master and PR build on the provided environment for the large bigint variant query; both used inverted-index filtering with RowsInvertedIndexFiltered=19.
- Behavior changed: No
- Does this need documentation: No
@wuguowei1994 wuguowei1994 force-pushed the fix-variant-inverted-index-cast branch from 8eef016 to f6ef2d9 Compare June 3, 2026 01:57
@wuguowei1994 wuguowei1994 force-pushed the fix-variant-inverted-index-cast branch from f6ef2d9 to 9afb062 Compare June 3, 2026 03:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants