Add datasketches HLL sketch aggregate functions by nooneuse · Pull Request #63143 · apache/doris

nooneuse · 2026-05-11T11:30:06Z

What problem does this PR solve?

An aggregate function is required to process user data containing Datasketches HLL sketches. In many data aggregation scenarios, users pre‑aggregate detailed data in Hive using the sketching techniques provided by Apache Datasketches, and then analyze the resulting sketches across various OLAP engines. Compared with the HLL union aggregate functions natively offered by these engines, there are two key diff to using Datasketches HLL sketches: firstly, the use cases differ; and secondly, HLL sketches can be used seamlessly across different engines—for example, simultaneously in ES, Doris, and ClickHouse. Such requirements are common in many production environments.

Issue Number:

Summary:
Implemented a built-in aggregate function that integrates the Datasketches HLL sketch. This aggregate function cannot rely on the Java UDF environment. Considering that in the Java UDF environment, Strings are encoded in UTF-8, which corrupts the binary data of sketches, the serialization/deserialization operations for sketches must be implemented on the BE side. (additionally, since Apache Datasketches has been added to the contrib directory via a git submodule, it will become very easy to add other sketches such as theta sketch in the future.)

see: #63142
use case: see regression test & #63142

Release note

Add Apache Datasketches Thirdparty submodule
Implemented an aggregate function that integrates the Datasketches HLL sketch.

Check List (For Author)

Test
- Regression test
- Unit Test
Behavior changed:
- No.
Does this need documentation?
- See: add docs for aggregation function datasketches_hll_union_agg doris-website#3711

Check List (For Reviewer who merge this PR)

Confirm the release note
Confirm test cases
Confirm document
Add branch pick label

hello-stephen · 2026-05-11T11:30:11Z

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

What problem was fixed (it's best to include specific error reporting information). How it was fixed.
Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
What features were added. Why was this function added?
Which code was refactored and why was this part of the code refactored?
Which functions were optimized and what is the difference before and after the optimization?

BePPPower · 2026-05-11T11:50:24Z

run buildall

nooneuse · 2026-05-11T12:31:27Z

run buildall

nooneuse · 2026-05-11T13:43:54Z

compile

hello-stephen · 2026-05-11T14:56:58Z

FE UT Coverage Report

Increment line coverage 4.00% (1/25) 🎉
Increment coverage report
Complete coverage report

hello-stephen · 2026-05-11T16:20:20Z

Cloud UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category	Coverage
Function Coverage	78.05% (1849/2369)
Line Coverage	64.73% (33222/51327)
Region Coverage	65.25% (16441/25198)
Branch Coverage	55.81% (8780/15732)

hello-stephen · 2026-05-11T17:07:12Z

FE UT Coverage Report

Increment line coverage 4.00% (1/25) 🎉
Increment coverage report
Complete coverage report

nooneuse · 2026-05-12T04:08:03Z

run buildall

nooneuse · 2026-05-12T06:43:43Z

run buildall

hello-stephen · 2026-05-12T08:07:25Z

Cloud UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category	Coverage
Function Coverage	78.05% (1849/2369)
Line Coverage	64.73% (33225/51327)
Region Coverage	65.24% (16439/25198)
Branch Coverage	55.80% (8779/15732)

hello-stephen · 2026-05-12T08:37:50Z

FE UT Coverage Report

Increment line coverage 4.00% (1/25) 🎉
Increment coverage report
Complete coverage report

nooneuse · 2026-05-26T13:11:56Z

run buildall

nooneuse · 2026-05-26T13:53:59Z

run buildall

nooneuse · 2026-05-26T13:55:38Z

run buildall

nooneuse · 2026-05-27T02:44:42Z

run buildall

nooneuse · 2026-05-27T05:52:03Z

run buildall

nooneuse · 2026-05-27T05:57:08Z

Hi, @zclllyybb @linrrzqqq Sorry to bother you again. I have finished making the bot's review changes once more. When you have time, could you please help trigger /review?

hello-stephen · 2026-05-27T06:16:44Z

TPC-H: Total hot run time: 31952 ms

machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 253ab4748167c417873cd71226760a8c1ddd69a4, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17647	4141	4020	4020
q2	q3	10859	1382	812	812
q4	4686	479	350	350
q5	7575	2280	2105	2105
q6	243	178	139	139
q7	949	787	637	637
q8	9451	1717	1638	1638
q9	5318	4999	4951	4951
q10	6385	2188	1882	1882
q11	454	279	246	246
q12	636	438	297	297
q13	18103	3423	2789	2789
q14	266	260	239	239
q15	q16	831	765	714	714
q17	991	987	972	972
q18	7299	5813	5643	5643
q19	1192	1266	1218	1218
q20	559	472	303	303
q21	5883	2862	2616	2616
q22	457	437	381	381
Total cold run time: 99784 ms
Total hot run time: 31952 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4808	4931	4742	4742
q2	q3	4876	5222	4723	4723
q4	2081	2173	1386	1386
q5	4717	4890	4773	4773
q6	236	177	130	130
q7	1847	1746	1592	1592
q8	2450	2098	2103	2098
q9	8063	7921	7386	7386
q10	4719	4671	4231	4231
q11	539	377	354	354
q12	801	729	527	527
q13	2981	3437	2779	2779
q14	276	278	258	258
q15	q16	678	702	610	610
q17	1269	1256	1249	1249
q18	7341	6710	6690	6690
q19	1141	1107	1105	1105
q20	2224	2208	1950	1950
q21	5260	4570	4432	4432
q22	542	482	393	393
Total cold run time: 56849 ms
Total hot run time: 51408 ms

hello-stephen · 2026-05-27T06:27:42Z

TPC-DS: Total hot run time: 171029 ms

machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 253ab4748167c417873cd71226760a8c1ddd69a4, data reload: false

query5	4313	651	520	520
query6	328	233	208	208
query7	4223	558	324	324
query8	357	242	221	221
query9	8803	4019	3996	3996
query10	446	344	285	285
query11	5770	2614	2237	2237
query12	182	124	123	123
query13	1304	634	440	440
query14	6154	5442	5202	5202
query14_1	4474	4495	4472	4472
query15	220	210	184	184
query16	1006	441	436	436
query17	1056	723	591	591
query18	2443	476	353	353
query19	208	201	156	156
query20	136	135	130	130
query21	210	135	118	118
query22	13707	13544	13388	13388
query23	17469	16606	16254	16254
query23_1	16339	16447	16434	16434
query24	7434	1760	1290	1290
query24_1	1348	1293	1330	1293
query25	554	487	417	417
query26	1308	321	173	173
query27	2695	577	353	353
query28	4447	2013	1994	1994
query29	967	611	498	498
query30	295	242	202	202
query31	1139	1076	940	940
query32	93	79	75	75
query33	528	351	289	289
query34	1167	1147	661	661
query35	770	795	705	705
query36	1455	1420	1281	1281
query37	157	109	96	96
query38	3213	3185	3065	3065
query39	927	917	885	885
query39_1	883	871	902	871
query40	243	151	131	131
query41	72	68	69	68
query42	113	113	110	110
query43	333	338	293	293
query44	
query45	219	210	201	201
query46	1120	1196	757	757
query47	2364	2426	2315	2315
query48	412	409	307	307
query49	660	529	415	415
query50	1023	357	255	255
query51	4348	4286	4401	4286
query52	106	105	101	101
query53	257	284	213	213
query54	342	290	269	269
query55	96	94	89	89
query56	326	345	316	316
query57	1459	1430	1342	1342
query58	298	277	274	274
query59	1571	1708	1436	1436
query60	320	315	302	302
query61	159	155	159	155
query62	703	650	576	576
query63	249	199	214	199
query64	2397	807	629	629
query65	
query66	1719	481	361	361
query67	29910	29818	29668	29668
query68	
query69	469	340	303	303
query70	1028	1015	918	918
query71	304	273	268	268
query72	2914	2622	2413	2413
query73	851	799	411	411
query74	5122	4971	4793	4793
query75	2711	2618	2277	2277
query76	2291	1136	761	761
query77	406	400	333	333
query78	12583	12548	11977	11977
query79	1447	1043	772	772
query80	645	540	457	457
query81	456	281	245	245
query82	1371	153	124	124
query83	362	284	248	248
query84	285	140	110	110
query85	889	532	475	475
query86	411	354	351	351
query87	3452	3378	3278	3278
query88	3621	2758	2738	2738
query89	438	389	349	349
query90	1897	182	184	182
query91	179	172	142	142
query92	79	80	74	74
query93	1426	1443	834	834
query94	558	366	313	313
query95	679	383	443	383
query96	1046	847	345	345
query97	2731	2728	2613	2613
query98	240	232	226	226
query99	1189	1139	1022	1022
Total cold run time: 254265 ms
Total hot run time: 171029 ms

hello-stephen · 2026-05-27T08:12:57Z

BE Regression && UT Coverage Report

Increment line coverage 84.18% (133/158) 🎉

Increment coverage report
Complete coverage report

Category	Coverage
Function Coverage	73.83% (28093/38050)
Line Coverage	57.72% (304801/528039)
Region Coverage	54.85% (254972/464815)
Branch Coverage	56.40% (110172/195357)

hello-stephen · 2026-05-27T08:19:53Z

FE Regression Coverage Report

Increment line coverage 72.00% (18/25) 🎉
Increment coverage report
Complete coverage report

nooneuse · 2026-05-27T08:35:53Z

Hi, excuse me @linrrzqqq @zclllyybb , could you please help trigger /review? I have already discussed/fixed all the current comments again~

nooneuse · 2026-05-28T02:27:19Z

Hello @linrrzqqq @zclllyybb , could you please help trigger /review? I have already discussed/fixed all the current comments again~ thanks!

nooneuse · 2026-05-28T07:37:44Z

Hi @linrrzqqq , could you please help trigger /review?

linrrzqqq · 2026-05-29T02:46:50Z

/review

github-actions

Automated review summary for head 253ab47.

No new blocking findings after reviewing the current diff and the existing inline review threads. I did not re-raise the already-known items around submodule fallback URL behavior, input type coercion, TYPE_BINARY registration, memory tracking, exception-code preservation, empty string handling, empty-state merge, low-lg_k sketches, aliases, and regression cleanup.

Critical checkpoint conclusions:

Goal and tests: the PR adds a DataSketches HLL union estimate aggregate with BE/FE registration, aliases, BE unit tests, and regression coverage for string, varchar, varbinary, aliases, distinct, empty input, and corrupt input. The current code appears to implement that goal.
Scope: the feature is focused on one aggregate function plus the required build/submodule wiring and tests.
Concurrency: no new shared mutable concurrent state or lock ordering was introduced; aggregate state is per group/place.
Lifecycle/static initialization: no cross-translation-unit static initialization dependency was found beyond existing factory registration patterns.
Configuration: no new Doris configuration items were added.
Compatibility: no storage aggregation type is exposed in the final diff; intermediate aggregate state remains serialized through existing aggregate-function paths.
Parallel paths: FE and BE names, aliases, supported input types, return type, and visitor registration are present on both sides.
Special conditions: null/empty/corrupt input and empty aggregate states are handled deliberately, with existing comments/tests for non-obvious cases.
Test coverage: BE unit tests and regression tests cover the main behavior and prior review edge cases. I did not run the full BE/FE/regression test suites in this runner.
Test results: the added .out file matches the deterministic regression query outputs shown in the suite.
Observability: no additional logs or metrics appear necessary for this aggregate path.
Transaction/persistence/data writes: not applicable; this is query-time aggregate evaluation and build wiring.
FE-BE variables/protocol: no new thrift field or runtime variable passing was introduced.
Performance/memory: the aggregate now uses the Doris custom allocator for DataSketches-owned buffers, and I did not find an additional obvious hot-path regression.

User focus: no additional user-provided review focus was specified.

nooneuse · 2026-05-29T03:48:19Z

Automated review summary for head 253ab47.

No new blocking findings after reviewing the current diff and the existing inline review threads. I did not re-raise the already-known items around submodule fallback URL behavior, input type coercion, TYPE_BINARY registration, memory tracking, exception-code preservation, empty string handling, empty-state merge, low-lg_k sketches, aliases, and regression cleanup.

Critical checkpoint conclusions:

Goal and tests: the PR adds a DataSketches HLL union estimate aggregate with BE/FE registration, aliases, BE unit tests, and regression coverage for string, varchar, varbinary, aliases, distinct, empty input, and corrupt input. The current code appears to implement that goal.

Scope: the feature is focused on one aggregate function plus the required build/submodule wiring and tests.

Concurrency: no new shared mutable concurrent state or lock ordering was introduced; aggregate state is per group/place.

Lifecycle/static initialization: no cross-translation-unit static initialization dependency was found beyond existing factory registration patterns.

Configuration: no new Doris configuration items were added.

Compatibility: no storage aggregation type is exposed in the final diff; intermediate aggregate state remains serialized through existing aggregate-function paths.

Parallel paths: FE and BE names, aliases, supported input types, return type, and visitor registration are present on both sides.

Special conditions: null/empty/corrupt input and empty aggregate states are handled deliberately, with existing comments/tests for non-obvious cases.

Test coverage: BE unit tests and regression tests cover the main behavior and prior review edge cases. I did not run the full BE/FE/regression test suites in this runner.

Test results: the added .out file matches the deterministic regression query outputs shown in the suite.

Observability: no additional logs or metrics appear necessary for this aggregate path.

Transaction/persistence/data writes: not applicable; this is query-time aggregate evaluation and build wiring.

FE-BE variables/protocol: no new thrift field or runtime variable passing was introduced.

Performance/memory: the aggregate now uses the Doris custom allocator for DataSketches-owned buffers, and I did not find an additional obvious hot-path regression.

User focus: no additional user-provided review focus was specified.

Hello @zclllyybb @linrrzqqq ! Thanks for your coments! As it stands, the bot didn't find new issues after its review, and all known issues have been discussed/fixed by me. Could you please help review the code when you have time, so that we can continue to move the PR forward?

linrrzqqq

LGTM

fixed

### What problem does this PR solve? > An aggregate function is required to process user data containing Datasketches HLL sketches. In many data aggregation scenarios, users pre‑aggregate detailed data in Hive using the sketching techniques provided by Apache Datasketches, and then analyze the resulting sketches across various OLAP engines. Compared with the HLL union aggregate functions natively offered by these engines, there are two key diff to using Datasketches HLL sketches: firstly, the use cases differ; and secondly, HLL sketches can be used seamlessly across different engines—for example, simultaneously in ES, Doris, and ClickHouse. Such requirements are common in many production environments. Issue Number: - #63142(#63142) - #26416 - #56246 Summary: Implemented a built-in aggregate function that integrates the Datasketches HLL sketch. This aggregate function cannot rely on the Java UDF environment. Considering that in the Java UDF environment, Strings are encoded in UTF-8, which corrupts the binary data of sketches, the serialization/deserialization operations for sketches must be implemented on the BE side. (additionally, since Apache Datasketches has been added to the contrib directory via a git submodule, it will become very easy to add other sketches such as theta sketch in the future.) **see**: #63142 **use case**: see regression test & #63142 --------- Co-authored-by: yuanyuhao <yuanyuhao@bytedance.com>

…63911) Cherry-picked from #63143 Co-authored-by: nooneuse <nooneuse@users.noreply.github.com> Co-authored-by: yuanyuhao <yuanyuhao@bytedance.com>

yuanyuhao and others added 4 commits May 7, 2026 22:00

add datasketches HLL sketches union aggregate functions for doris

a70254f

fix typo & compile

8a277f9

add be unit test

89dae1b

fix corner case & add regression test

216d172

nooneuse requested review from BiteTheDDDDt, CalvinKirs, morningman and zclllyybb as code owners May 11, 2026 11:30

revert vcs.xml. This file modification is unnecessary

2fd6967

nooneuse commented May 11, 2026

View reviewed changes

Comment thread fe/pom.xml

nooneuse mentioned this pull request May 11, 2026

[Feature] Add Apache Datasketches HLL sketches aggregate function #63142

Open

3 tasks

change regression test to groovy-out style

cdf2f9d

Merge branch 'master' into add_datasketches_union_aggregate_functions

894be1a

nooneuse and others added 4 commits May 12, 2026 11:21

Merge branch 'master' into add_datasketches_union_aggregate_functions

201ae9c

reformat be codes

8d36961

reformat be codes (part2)

fd97b44

reformat imports lines

03e739c

fix submodule build command

99f59fe

fix be ut build script

09b8b25

Merge branch 'master' into add_datasketches_union_aggregate_functions

bcd64dc

Merge branch 'master' into add_datasketches_union_aggregate_functions

edfc10d

nooneuse added 2 commits May 27, 2026 13:44

reformat fe codes

24511ef

fix fe codes

253ab47

github-actions Bot reviewed May 29, 2026

View reviewed changes

linrrzqqq approved these changes May 30, 2026

View reviewed changes

zclllyybb approved these changes May 30, 2026

View reviewed changes

zclllyybb added the dev/4.1.x label May 30, 2026

zclllyybb merged commit 1b44c05 into apache:master May 30, 2026
31 of 32 checks passed

github-actions Bot mentioned this pull request May 30, 2026

branch-4.1: Add datasketches HLL sketch aggregate functions #63143 #63911

Merged

yiguolei added dev/4.1.2-merged and removed dev/4.1.x labels Jun 1, 2026

Conversation

nooneuse commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What problem does this PR solve?

Release note

Check List (For Author)

Check List (For Reviewer who merge this PR)

Uh oh!

hello-stephen commented May 11, 2026

Uh oh!

Uh oh!

BePPPower commented May 11, 2026

Uh oh!

nooneuse commented May 11, 2026

Uh oh!

nooneuse commented May 11, 2026

Uh oh!

hello-stephen commented May 11, 2026

FE UT Coverage Report

Uh oh!

hello-stephen commented May 11, 2026

Cloud UT Coverage Report

Uh oh!

hello-stephen commented May 11, 2026

FE UT Coverage Report

Uh oh!

nooneuse commented May 12, 2026

Uh oh!

nooneuse commented May 12, 2026

Uh oh!

hello-stephen commented May 12, 2026

Cloud UT Coverage Report

Uh oh!

hello-stephen commented May 12, 2026

FE UT Coverage Report

Uh oh!

nooneuse commented May 26, 2026

Uh oh!

nooneuse commented May 26, 2026

Uh oh!

nooneuse commented May 26, 2026

Uh oh!

nooneuse commented May 27, 2026

Uh oh!

nooneuse commented May 27, 2026

Uh oh!

nooneuse commented May 27, 2026

Uh oh!

hello-stephen commented May 27, 2026

Uh oh!

hello-stephen commented May 27, 2026

Uh oh!

hello-stephen commented May 27, 2026

BE Regression && UT Coverage Report

Uh oh!

hello-stephen commented May 27, 2026

FE Regression Coverage Report

Uh oh!

nooneuse commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nooneuse commented May 28, 2026

Uh oh!

nooneuse commented May 28, 2026

Uh oh!

linrrzqqq commented May 29, 2026

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

nooneuse commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

linrrzqqq left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

nooneuse commented May 11, 2026 •

edited

Loading

nooneuse commented May 27, 2026 •

edited

Loading

nooneuse commented May 29, 2026 •

edited

Loading