Add datasketches HLL sketch aggregate functions#63143
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
run buildall |
|
run buildall |
|
compile |
FE UT Coverage ReportIncrement line coverage |
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
FE UT Coverage ReportIncrement line coverage |
|
run buildall |
|
run buildall |
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
FE UT Coverage ReportIncrement line coverage |
|
run buildall |
1 similar comment
|
run buildall |
|
run buildall |
|
run buildall |
|
run buildall |
|
Hi, @zclllyybb @linrrzqqq Sorry to bother you again. I have finished making the bot's review changes once more. When you have time, could you please help trigger /review? |
TPC-H: Total hot run time: 31952 ms |
TPC-DS: Total hot run time: 171029 ms |
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
FE Regression Coverage ReportIncrement line coverage |
|
Hi, excuse me @linrrzqqq @zclllyybb , could you please help trigger /review? I have already discussed/fixed all the current comments again~ |
|
Hello @linrrzqqq @zclllyybb , could you please help trigger /review? I have already discussed/fixed all the current comments again~ thanks! |
|
Hi @linrrzqqq , could you please help trigger /review? |
|
/review |
There was a problem hiding this comment.
Automated review summary for head 253ab47.
No new blocking findings after reviewing the current diff and the existing inline review threads. I did not re-raise the already-known items around submodule fallback URL behavior, input type coercion, TYPE_BINARY registration, memory tracking, exception-code preservation, empty string handling, empty-state merge, low-lg_k sketches, aliases, and regression cleanup.
Critical checkpoint conclusions:
- Goal and tests: the PR adds a DataSketches HLL union estimate aggregate with BE/FE registration, aliases, BE unit tests, and regression coverage for string, varchar, varbinary, aliases, distinct, empty input, and corrupt input. The current code appears to implement that goal.
- Scope: the feature is focused on one aggregate function plus the required build/submodule wiring and tests.
- Concurrency: no new shared mutable concurrent state or lock ordering was introduced; aggregate state is per group/place.
- Lifecycle/static initialization: no cross-translation-unit static initialization dependency was found beyond existing factory registration patterns.
- Configuration: no new Doris configuration items were added.
- Compatibility: no storage aggregation type is exposed in the final diff; intermediate aggregate state remains serialized through existing aggregate-function paths.
- Parallel paths: FE and BE names, aliases, supported input types, return type, and visitor registration are present on both sides.
- Special conditions: null/empty/corrupt input and empty aggregate states are handled deliberately, with existing comments/tests for non-obvious cases.
- Test coverage: BE unit tests and regression tests cover the main behavior and prior review edge cases. I did not run the full BE/FE/regression test suites in this runner.
- Test results: the added .out file matches the deterministic regression query outputs shown in the suite.
- Observability: no additional logs or metrics appear necessary for this aggregate path.
- Transaction/persistence/data writes: not applicable; this is query-time aggregate evaluation and build wiring.
- FE-BE variables/protocol: no new thrift field or runtime variable passing was introduced.
- Performance/memory: the aggregate now uses the Doris custom allocator for DataSketches-owned buffers, and I did not find an additional obvious hot-path regression.
User focus: no additional user-provided review focus was specified.
Hello @zclllyybb @linrrzqqq ! Thanks for your coments! As it stands, the bot didn't find new issues after its review, and all known issues have been discussed/fixed by me. Could you please help review the code when you have time, so that we can continue to move the PR forward? |
### What problem does this PR solve? > An aggregate function is required to process user data containing Datasketches HLL sketches. In many data aggregation scenarios, users pre‑aggregate detailed data in Hive using the sketching techniques provided by Apache Datasketches, and then analyze the resulting sketches across various OLAP engines. Compared with the HLL union aggregate functions natively offered by these engines, there are two key diff to using Datasketches HLL sketches: firstly, the use cases differ; and secondly, HLL sketches can be used seamlessly across different engines—for example, simultaneously in ES, Doris, and ClickHouse. Such requirements are common in many production environments. Issue Number: - #63142(#63142) - #26416 - #56246 Summary: Implemented a built-in aggregate function that integrates the Datasketches HLL sketch. This aggregate function cannot rely on the Java UDF environment. Considering that in the Java UDF environment, Strings are encoded in UTF-8, which corrupts the binary data of sketches, the serialization/deserialization operations for sketches must be implemented on the BE side. (additionally, since Apache Datasketches has been added to the contrib directory via a git submodule, it will become very easy to add other sketches such as theta sketch in the future.) **see**: #63142 **use case**: see regression test & #63142 --------- Co-authored-by: yuanyuhao <yuanyuhao@bytedance.com>
What problem does this PR solve?
Issue Number:
Summary:
Implemented a built-in aggregate function that integrates the Datasketches HLL sketch. This aggregate function cannot rely on the Java UDF environment. Considering that in the Java UDF environment, Strings are encoded in UTF-8, which corrupts the binary data of sketches, the serialization/deserialization operations for sketches must be implemented on the BE side. (additionally, since Apache Datasketches has been added to the contrib directory via a git submodule, it will become very easy to add other sketches such as theta sketch in the future.)
see: #63142
use case: see regression test & #63142
Release note
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)