Skip to content

Add datasketches HLL sketch aggregate functions#63143

Merged
zclllyybb merged 69 commits into
apache:masterfrom
nooneuse:add_datasketches_union_aggregate_functions
May 30, 2026
Merged

Add datasketches HLL sketch aggregate functions#63143
zclllyybb merged 69 commits into
apache:masterfrom
nooneuse:add_datasketches_union_aggregate_functions

Conversation

@nooneuse
Copy link
Copy Markdown
Contributor

@nooneuse nooneuse commented May 11, 2026

What problem does this PR solve?

An aggregate function is required to process user data containing Datasketches HLL sketches. In many data aggregation scenarios, users pre‑aggregate detailed data in Hive using the sketching techniques provided by Apache Datasketches, and then analyze the resulting sketches across various OLAP engines. Compared with the HLL union aggregate functions natively offered by these engines, there are two key diff to using Datasketches HLL sketches: firstly, the use cases differ; and secondly, HLL sketches can be used seamlessly across different engines—for example, simultaneously in ES, Doris, and ClickHouse. Such requirements are common in many production environments.

Issue Number:

Summary:
Implemented a built-in aggregate function that integrates the Datasketches HLL sketch. This aggregate function cannot rely on the Java UDF environment. Considering that in the Java UDF environment, Strings are encoded in UTF-8, which corrupts the binary data of sketches, the serialization/deserialization operations for sketches must be implemented on the BE side. (additionally, since Apache Datasketches has been added to the contrib directory via a git submodule, it will become very easy to add other sketches such as theta sketch in the future.)

see: #63142
use case: see regression test & #63142

Release note

  1. Add Apache Datasketches Thirdparty submodule
  2. Implemented an aggregate function that integrates the Datasketches HLL sketch.

Check List (For Author)

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

Comment thread fe/pom.xml
@BePPPower
Copy link
Copy Markdown
Contributor

run buildall

@nooneuse
Copy link
Copy Markdown
Contributor Author

run buildall

@nooneuse
Copy link
Copy Markdown
Contributor Author

compile

@hello-stephen
Copy link
Copy Markdown
Contributor

FE UT Coverage Report

Increment line coverage 4.00% (1/25) 🎉
Increment coverage report
Complete coverage report

@hello-stephen
Copy link
Copy Markdown
Contributor

Cloud UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 78.05% (1849/2369)
Line Coverage 64.73% (33222/51327)
Region Coverage 65.25% (16441/25198)
Branch Coverage 55.81% (8780/15732)

@hello-stephen
Copy link
Copy Markdown
Contributor

FE UT Coverage Report

Increment line coverage 4.00% (1/25) 🎉
Increment coverage report
Complete coverage report

@nooneuse
Copy link
Copy Markdown
Contributor Author

run buildall

@nooneuse
Copy link
Copy Markdown
Contributor Author

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

Cloud UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 78.05% (1849/2369)
Line Coverage 64.73% (33225/51327)
Region Coverage 65.24% (16439/25198)
Branch Coverage 55.80% (8779/15732)

@hello-stephen
Copy link
Copy Markdown
Contributor

FE UT Coverage Report

Increment line coverage 4.00% (1/25) 🎉
Increment coverage report
Complete coverage report

@nooneuse
Copy link
Copy Markdown
Contributor Author

run buildall

1 similar comment
@nooneuse
Copy link
Copy Markdown
Contributor Author

run buildall

@nooneuse
Copy link
Copy Markdown
Contributor Author

run buildall

@nooneuse
Copy link
Copy Markdown
Contributor Author

run buildall

@nooneuse
Copy link
Copy Markdown
Contributor Author

run buildall

@nooneuse
Copy link
Copy Markdown
Contributor Author

Hi, @zclllyybb @linrrzqqq Sorry to bother you again. I have finished making the bot's review changes once more. When you have time, could you please help trigger /review?

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-H: Total hot run time: 31952 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 253ab4748167c417873cd71226760a8c1ddd69a4, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17647	4141	4020	4020
q2	q3	10859	1382	812	812
q4	4686	479	350	350
q5	7575	2280	2105	2105
q6	243	178	139	139
q7	949	787	637	637
q8	9451	1717	1638	1638
q9	5318	4999	4951	4951
q10	6385	2188	1882	1882
q11	454	279	246	246
q12	636	438	297	297
q13	18103	3423	2789	2789
q14	266	260	239	239
q15	q16	831	765	714	714
q17	991	987	972	972
q18	7299	5813	5643	5643
q19	1192	1266	1218	1218
q20	559	472	303	303
q21	5883	2862	2616	2616
q22	457	437	381	381
Total cold run time: 99784 ms
Total hot run time: 31952 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4808	4931	4742	4742
q2	q3	4876	5222	4723	4723
q4	2081	2173	1386	1386
q5	4717	4890	4773	4773
q6	236	177	130	130
q7	1847	1746	1592	1592
q8	2450	2098	2103	2098
q9	8063	7921	7386	7386
q10	4719	4671	4231	4231
q11	539	377	354	354
q12	801	729	527	527
q13	2981	3437	2779	2779
q14	276	278	258	258
q15	q16	678	702	610	610
q17	1269	1256	1249	1249
q18	7341	6710	6690	6690
q19	1141	1107	1105	1105
q20	2224	2208	1950	1950
q21	5260	4570	4432	4432
q22	542	482	393	393
Total cold run time: 56849 ms
Total hot run time: 51408 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-DS: Total hot run time: 171029 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 253ab4748167c417873cd71226760a8c1ddd69a4, data reload: false

query5	4313	651	520	520
query6	328	233	208	208
query7	4223	558	324	324
query8	357	242	221	221
query9	8803	4019	3996	3996
query10	446	344	285	285
query11	5770	2614	2237	2237
query12	182	124	123	123
query13	1304	634	440	440
query14	6154	5442	5202	5202
query14_1	4474	4495	4472	4472
query15	220	210	184	184
query16	1006	441	436	436
query17	1056	723	591	591
query18	2443	476	353	353
query19	208	201	156	156
query20	136	135	130	130
query21	210	135	118	118
query22	13707	13544	13388	13388
query23	17469	16606	16254	16254
query23_1	16339	16447	16434	16434
query24	7434	1760	1290	1290
query24_1	1348	1293	1330	1293
query25	554	487	417	417
query26	1308	321	173	173
query27	2695	577	353	353
query28	4447	2013	1994	1994
query29	967	611	498	498
query30	295	242	202	202
query31	1139	1076	940	940
query32	93	79	75	75
query33	528	351	289	289
query34	1167	1147	661	661
query35	770	795	705	705
query36	1455	1420	1281	1281
query37	157	109	96	96
query38	3213	3185	3065	3065
query39	927	917	885	885
query39_1	883	871	902	871
query40	243	151	131	131
query41	72	68	69	68
query42	113	113	110	110
query43	333	338	293	293
query44	
query45	219	210	201	201
query46	1120	1196	757	757
query47	2364	2426	2315	2315
query48	412	409	307	307
query49	660	529	415	415
query50	1023	357	255	255
query51	4348	4286	4401	4286
query52	106	105	101	101
query53	257	284	213	213
query54	342	290	269	269
query55	96	94	89	89
query56	326	345	316	316
query57	1459	1430	1342	1342
query58	298	277	274	274
query59	1571	1708	1436	1436
query60	320	315	302	302
query61	159	155	159	155
query62	703	650	576	576
query63	249	199	214	199
query64	2397	807	629	629
query65	
query66	1719	481	361	361
query67	29910	29818	29668	29668
query68	
query69	469	340	303	303
query70	1028	1015	918	918
query71	304	273	268	268
query72	2914	2622	2413	2413
query73	851	799	411	411
query74	5122	4971	4793	4793
query75	2711	2618	2277	2277
query76	2291	1136	761	761
query77	406	400	333	333
query78	12583	12548	11977	11977
query79	1447	1043	772	772
query80	645	540	457	457
query81	456	281	245	245
query82	1371	153	124	124
query83	362	284	248	248
query84	285	140	110	110
query85	889	532	475	475
query86	411	354	351	351
query87	3452	3378	3278	3278
query88	3621	2758	2738	2738
query89	438	389	349	349
query90	1897	182	184	182
query91	179	172	142	142
query92	79	80	74	74
query93	1426	1443	834	834
query94	558	366	313	313
query95	679	383	443	383
query96	1046	847	345	345
query97	2731	2728	2613	2613
query98	240	232	226	226
query99	1189	1139	1022	1022
Total cold run time: 254265 ms
Total hot run time: 171029 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 84.18% (133/158) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 73.83% (28093/38050)
Line Coverage 57.72% (304801/528039)
Region Coverage 54.85% (254972/464815)
Branch Coverage 56.40% (110172/195357)

@hello-stephen
Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 72.00% (18/25) 🎉
Increment coverage report
Complete coverage report

@nooneuse
Copy link
Copy Markdown
Contributor Author

nooneuse commented May 27, 2026

Hi, excuse me @linrrzqqq @zclllyybb , could you please help trigger /review? I have already discussed/fixed all the current comments again~

@nooneuse
Copy link
Copy Markdown
Contributor Author

Hello @linrrzqqq @zclllyybb , could you please help trigger /review? I have already discussed/fixed all the current comments again~ thanks!

@nooneuse
Copy link
Copy Markdown
Contributor Author

Hi @linrrzqqq , could you please help trigger /review?

@linrrzqqq
Copy link
Copy Markdown
Collaborator

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated review summary for head 253ab47.

No new blocking findings after reviewing the current diff and the existing inline review threads. I did not re-raise the already-known items around submodule fallback URL behavior, input type coercion, TYPE_BINARY registration, memory tracking, exception-code preservation, empty string handling, empty-state merge, low-lg_k sketches, aliases, and regression cleanup.

Critical checkpoint conclusions:

  • Goal and tests: the PR adds a DataSketches HLL union estimate aggregate with BE/FE registration, aliases, BE unit tests, and regression coverage for string, varchar, varbinary, aliases, distinct, empty input, and corrupt input. The current code appears to implement that goal.
  • Scope: the feature is focused on one aggregate function plus the required build/submodule wiring and tests.
  • Concurrency: no new shared mutable concurrent state or lock ordering was introduced; aggregate state is per group/place.
  • Lifecycle/static initialization: no cross-translation-unit static initialization dependency was found beyond existing factory registration patterns.
  • Configuration: no new Doris configuration items were added.
  • Compatibility: no storage aggregation type is exposed in the final diff; intermediate aggregate state remains serialized through existing aggregate-function paths.
  • Parallel paths: FE and BE names, aliases, supported input types, return type, and visitor registration are present on both sides.
  • Special conditions: null/empty/corrupt input and empty aggregate states are handled deliberately, with existing comments/tests for non-obvious cases.
  • Test coverage: BE unit tests and regression tests cover the main behavior and prior review edge cases. I did not run the full BE/FE/regression test suites in this runner.
  • Test results: the added .out file matches the deterministic regression query outputs shown in the suite.
  • Observability: no additional logs or metrics appear necessary for this aggregate path.
  • Transaction/persistence/data writes: not applicable; this is query-time aggregate evaluation and build wiring.
  • FE-BE variables/protocol: no new thrift field or runtime variable passing was introduced.
  • Performance/memory: the aggregate now uses the Doris custom allocator for DataSketches-owned buffers, and I did not find an additional obvious hot-path regression.

User focus: no additional user-provided review focus was specified.

@nooneuse
Copy link
Copy Markdown
Contributor Author

nooneuse commented May 29, 2026

Automated review summary for head 253ab47.

No new blocking findings after reviewing the current diff and the existing inline review threads. I did not re-raise the already-known items around submodule fallback URL behavior, input type coercion, TYPE_BINARY registration, memory tracking, exception-code preservation, empty string handling, empty-state merge, low-lg_k sketches, aliases, and regression cleanup.

Critical checkpoint conclusions:

  • Goal and tests: the PR adds a DataSketches HLL union estimate aggregate with BE/FE registration, aliases, BE unit tests, and regression coverage for string, varchar, varbinary, aliases, distinct, empty input, and corrupt input. The current code appears to implement that goal.
  • Scope: the feature is focused on one aggregate function plus the required build/submodule wiring and tests.
  • Concurrency: no new shared mutable concurrent state or lock ordering was introduced; aggregate state is per group/place.
  • Lifecycle/static initialization: no cross-translation-unit static initialization dependency was found beyond existing factory registration patterns.
  • Configuration: no new Doris configuration items were added.
  • Compatibility: no storage aggregation type is exposed in the final diff; intermediate aggregate state remains serialized through existing aggregate-function paths.
  • Parallel paths: FE and BE names, aliases, supported input types, return type, and visitor registration are present on both sides.
  • Special conditions: null/empty/corrupt input and empty aggregate states are handled deliberately, with existing comments/tests for non-obvious cases.
  • Test coverage: BE unit tests and regression tests cover the main behavior and prior review edge cases. I did not run the full BE/FE/regression test suites in this runner.
  • Test results: the added .out file matches the deterministic regression query outputs shown in the suite.
  • Observability: no additional logs or metrics appear necessary for this aggregate path.
  • Transaction/persistence/data writes: not applicable; this is query-time aggregate evaluation and build wiring.
  • FE-BE variables/protocol: no new thrift field or runtime variable passing was introduced.
  • Performance/memory: the aggregate now uses the Doris custom allocator for DataSketches-owned buffers, and I did not find an additional obvious hot-path regression.

User focus: no additional user-provided review focus was specified.

Hello @zclllyybb @linrrzqqq ! Thanks for your coments! As it stands, the bot didn't find new issues after its review, and all known issues have been discussed/fixed by me. Could you please help review the code when you have time, so that we can continue to move the PR forward?

Copy link
Copy Markdown
Collaborator

@linrrzqqq linrrzqqq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@zclllyybb zclllyybb merged commit 1b44c05 into apache:master May 30, 2026
31 of 32 checks passed
github-actions Bot pushed a commit that referenced this pull request May 30, 2026
### What problem does this PR solve?

> An aggregate function is required to process user data containing
Datasketches HLL sketches. In many data aggregation scenarios, users
pre‑aggregate detailed data in Hive using the sketching techniques
provided by Apache Datasketches, and then analyze the resulting sketches
across various OLAP engines. Compared with the HLL union aggregate
functions natively offered by these engines, there are two key diff to
using Datasketches HLL sketches: firstly, the use cases differ; and
secondly, HLL sketches can be used seamlessly across different
engines—for example, simultaneously in ES, Doris, and ClickHouse. Such
requirements are common in many production environments.

Issue Number: 
- #63142(#63142)
- #26416
- #56246

Summary:
Implemented a built-in aggregate function that integrates the
Datasketches HLL sketch. This aggregate function cannot rely on the Java
UDF environment. Considering that in the Java UDF environment, Strings
are encoded in UTF-8, which corrupts the binary data of sketches, the
serialization/deserialization operations for sketches must be
implemented on the BE side. (additionally, since Apache Datasketches has
been added to the contrib directory via a git submodule, it will become
very easy to add other sketches such as theta sketch in the future.)

**see**: #63142
**use case**: see regression test &
#63142

---------

Co-authored-by: yuanyuhao <yuanyuhao@bytedance.com>
yiguolei pushed a commit that referenced this pull request Jun 1, 2026
…63911)

Cherry-picked from #63143

Co-authored-by: nooneuse <nooneuse@users.noreply.github.com>
Co-authored-by: yuanyuhao <yuanyuhao@bytedance.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants