perf : Optimize count distinct using bitmaps instead of hashsets for smaller datatypes by coderfender · Pull Request #21456 · apache/datafusion

coderfender · 2026-04-08T08:25:40Z

Which issue does this PR close?

Remove hashset based accumulators for smaller int data types and use bitmaps. Follow up of : #21453

Closes Use bitmap for count_distinct expression for u8/16 and i8/16 [perf] #21488

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

coderfender · 2026-04-08T08:26:42Z

benchmark results :

count_distinct i16 bitmap                      1.00      3.3±0.43µs        ? ?/sec    23.87    78.4±0.84µs        ? ?/sec
count_distinct i8 bitmap                       1.00      2.3±0.49µs        ? ?/sec    7.13     16.7±0.55µs        ? ?/sec
count_distinct u16 bitmap                      1.00      3.1±0.18µs        ? ?/sec    25.45    78.8±3.92µs        ? ?/sec
count_distinct u8 bitmap                       1.00      2.3±0.34µs        ? ?/sec    7.37     16.9±0.14µs        ? ?/sec

It seems like we are 25x faster for u16 bitmap based accumulators (or I am sleepy :) )

Dandandan · 2026-04-08T09:11:51Z

I think we can do the same for 16 bit types, it is just 65_536 bytes 8192 if we use a bitmap.

Dandandan · 2026-04-08T09:12:46Z

Oh wait, you're already doing that :)

coderfender · 2026-04-08T21:38:57Z

Query 0 in clickbench_extended dataset (which uses count distinct on u8 is now ~ 11 % faster :

┃ Query     ┃    main_cb ┃ bitmap_cb_2 ┃        Change ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0  │  529.46 ms │   478.99 ms │ +1.11x faster │

(Other queries are faster but I believe that is more around variance )

┏━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query     ┃    main_cb ┃ bitmap_cb_2 ┃        Change ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0  │  529.46 ms │   478.99 ms │ +1.11x faster │
│ QQuery 1  │  107.43 ms │   102.59 ms │     no change │
│ QQuery 2  │  250.89 ms │   240.76 ms │     no change │
│ QQuery 3  │  207.67 ms │   207.49 ms │     no change │
│ QQuery 4  │  391.43 ms │   353.05 ms │ +1.11x faster │
│ QQuery 5  │ 4144.11 ms │  4084.08 ms │     no change │
│ QQuery 6  │  676.03 ms │   622.21 ms │ +1.09x faster │
│ QQuery 7  │  719.78 ms │   599.06 ms │ +1.20x faster │
│ QQuery 8  │  238.30 ms │   207.27 ms │ +1.15x faster │
│ QQuery 9  │ 1531.52 ms │  1406.34 ms │ +1.09x faster │
│ QQuery 10 │  435.27 ms │   403.28 ms │ +1.08x faster │
│ QQuery 11 │ 1043.68 ms │   955.22 ms │ +1.09x faster │
│ QQuery 12 │  113.16 ms │   106.31 ms │ +1.06x faster │
└───────────┴────────────┴─────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary          ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (main_cb)       │ 10388.73ms │
│ Total Time (bitmap_cb_2)   │  9766.66ms │
│ Average Time (main_cb)     │   799.13ms │
│ Average Time (bitmap_cb_2) │   751.28ms │
│ Queries Faster             │          9 │
│ Queries Slower             │          0 │
│ Queries with No Change     │          4 │
│ Queries with Failure       │          0 │
└────────────────────────────┴────────────┘

coderfender · 2026-04-08T21:41:04Z

cc : @neilconway , @alamb , @martin-g . Please take a look whenever you get a chance

alamb

This looks like a great idea. Thank you @coderfender

datafusion/functions-aggregate/Cargo.toml

datafusion/functions-aggregate-common/src/aggregate/count_distinct/native.rs

alamb · 2026-04-09T20:25:40Z

run benchmark count_distinct

alamb · 2026-04-09T20:26:12Z

run benchmark clickbench

adriangbot · 2026-04-09T20:28:17Z

🤖 Criterion benchmark running (GKE) | trigger
Instance: c4a-highmem-16 (12 vCPU / 65 GiB) | Linux bench-c4217267905-1036-wmjln 6.12.55+ #1 SMP Sun Feb 1 08:59:41 UTC 2026 aarch64 GNU/Linux

CPU Details (lscpu)

Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Comparing optimize_count_distinct (a13bcaa) to fbdf770 (merge-base) diff
BENCH_NAME=count_distinct
BENCH_COMMAND=cargo bench --features=parquet --bench count_distinct
BENCH_FILTER=
Results will be posted here when complete

File an issue against this benchmark runner

adriangbot · 2026-04-09T20:28:18Z

Benchmark for this request failed.

Last 20 lines of output:

Click to expand

rustc 1.94.1 (e408947bf 2026-03-25)
a13bcaad5b82e69257e82b741cf72619da838990
fbdf7703a96408b4eba27801431be8bf468734d8
error: failed to load manifest for workspace member `/workspace/datafusion-branch/datafusion/catalog`
referenced by workspace at `/workspace/datafusion-branch/Cargo.toml`

Caused by:
  failed to load manifest for dependency `datafusion-datasource`

Caused by:
  failed to load manifest for dependency `datafusion-physical-plan`

Caused by:
  failed to load manifest for dependency `datafusion-functions-aggregate`

Caused by:
  failed to parse manifest at `/workspace/datafusion-branch/datafusion/functions-aggregate/Cargo.toml`

Caused by:
  found duplicate bench name count_distinct, but all bench targets must have a unique name

File an issue against this benchmark runner

adriangbot · 2026-04-09T20:28:35Z

🤖 Criterion benchmark running (GKE) | trigger
Instance: c4a-highmem-16 (12 vCPU / 65 GiB) | Linux bench-c4217270691-1037-8wdsm 6.12.55+ #1 SMP Sun Feb 1 08:59:41 UTC 2026 aarch64 GNU/Linux

CPU Details (lscpu)

Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Comparing optimize_count_distinct (a13bcaa) to fbdf770 (merge-base) diff
BENCH_NAME=clickbench
BENCH_COMMAND=cargo bench --features=parquet --bench clickbench
BENCH_FILTER=
Results will be posted here when complete

File an issue against this benchmark runner

adriangbot · 2026-04-09T20:28:36Z

Benchmark for this request failed.

Last 20 lines of output:

Click to expand

rustc 1.94.1 (e408947bf 2026-03-25)
a13bcaad5b82e69257e82b741cf72619da838990
fbdf7703a96408b4eba27801431be8bf468734d8
error: failed to load manifest for workspace member `/workspace/datafusion-branch/datafusion/catalog`
referenced by workspace at `/workspace/datafusion-branch/Cargo.toml`

Caused by:
  failed to load manifest for dependency `datafusion-datasource`

Caused by:
  failed to load manifest for dependency `datafusion-physical-plan`

Caused by:
  failed to load manifest for dependency `datafusion-functions-aggregate`

Caused by:
  failed to parse manifest at `/workspace/datafusion-branch/datafusion/functions-aggregate/Cargo.toml`

Caused by:
  found duplicate bench name count_distinct, but all bench targets must have a unique name

File an issue against this benchmark runner

coderfender · 2026-04-09T21:57:53Z

Update benchmarks after rebase with main

Command :

cargo bench -p datafusion-functions-aggregate --bench count_distinct

group                                          bitmap_count_distinct                  main
-----                                          ---------------------                  ----
count_distinct i16 bitmap                      1.00      3.1±0.14µs        ? ?/sec    25.35    77.9±3.43µs        ? ?/sec
count_distinct i64 80% distinct                1.00     48.2±0.44µs        ? ?/sec    1.00     48.0±1.49µs        ? ?/sec
count_distinct i64 99% distinct                1.00     48.3±0.46µs        ? ?/sec    1.00     48.3±2.14µs        ? ?/sec
count_distinct i8 bitmap                       1.00      2.9±0.43µs        ? ?/sec    5.75     16.4±0.26µs        ? ?/sec
count_distinct u16 bitmap                      1.00      3.2±0.30µs        ? ?/sec    24.08    77.7±1.82µs        ? ?/sec
count_distinct u8 bitmap                       1.00      2.1±0.03µs        ? ?/sec    8.14     16.8±0.05µs        ? ?/sec

alamb · 2026-04-10T17:11:24Z

run benchmark count_distinct

alamb · 2026-04-10T17:12:08Z

run benchmark clickbench_partitioned

adriangbot · 2026-04-10T17:14:14Z

🤖 Criterion benchmark running (GKE) | trigger
Instance: c4a-highmem-16 (12 vCPU / 65 GiB) | Linux bench-c4225463737-1057-kctwz 6.12.55+ #1 SMP Sun Feb 1 08:59:41 UTC 2026 aarch64 GNU/Linux

CPU Details (lscpu)

Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Comparing optimize_count_distinct (61dc8e1) to eaf0a41 (merge-base) diff
BENCH_NAME=count_distinct
BENCH_COMMAND=cargo bench --features=parquet --bench count_distinct
BENCH_FILTER=
Results will be posted here when complete

File an issue against this benchmark runner

adriangbot · 2026-04-10T17:14:29Z

🤖 Benchmark running (GKE) | trigger
Instance: c4a-highmem-16 (12 vCPU / 65 GiB) | Linux bench-c4225467324-1058-p5tnm 6.12.55+ #1 SMP Sun Feb 1 08:59:41 UTC 2026 aarch64 GNU/Linux

CPU Details (lscpu)

Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Comparing optimize_count_distinct (61dc8e1) to eaf0a41 (merge-base) diff using: clickbench_partitioned
Results will be posted here when complete

File an issue against this benchmark runner

adriangbot · 2026-04-10T17:19:48Z

🤖 Criterion benchmark completed (GKE) | trigger

Instance: c4a-highmem-16 (12 vCPU / 65 GiB)

CPU Details (lscpu)

Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Details

group                              main                                   optimize_count_distinct
-----                              ----                                   -----------------------
count_distinct i16 bitmap          18.93   164.8±0.57µs        ? ?/sec    1.00      8.7±0.01µs        ? ?/sec
count_distinct i64 80% distinct    1.00    101.5±0.26µs        ? ?/sec    1.11    112.9±0.36µs        ? ?/sec
count_distinct i64 99% distinct    1.00    102.1±0.21µs        ? ?/sec    1.13    115.1±0.46µs        ? ?/sec
count_distinct i8 bitmap           5.34     31.1±0.07µs        ? ?/sec    1.00      5.8±0.00µs        ? ?/sec
count_distinct u16 bitmap          26.23   155.5±0.16µs        ? ?/sec    1.00      5.9±0.01µs        ? ?/sec
count_distinct u8 bitmap           5.29     30.9±0.05µs        ? ?/sec    1.00      5.8±0.01µs        ? ?/sec

Resource Usage

base (merge-base)

Metric	Value
Wall time	59.6s
Peak memory	3.5 GiB
Avg memory	3.5 GiB
CPU user	74.1s
CPU sys	0.8s
Peak spill	0 B

branch

Metric	Value
Wall time	53.7s
Peak memory	3.5 GiB
Avg memory	3.5 GiB
CPU user	68.9s
CPU sys	0.2s
Peak spill	0 B

File an issue against this benchmark runner

adriangbot · 2026-04-10T17:33:34Z

🤖 Benchmark completed (GKE) | trigger

Instance: c4a-highmem-16 (12 vCPU / 65 GiB)

CPU Details (lscpu)

Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Details

Comparing HEAD and optimize_count_distinct
--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query     ┃                                  HEAD ┃               optimize_count_distinct ┃        Change ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0  │          1.22 / 4.55 ±6.48 / 17.50 ms │          1.28 / 4.65 ±6.54 / 17.74 ms │     no change │
│ QQuery 1  │        14.60 / 14.84 ±0.23 / 15.13 ms │        14.29 / 14.96 ±0.40 / 15.44 ms │     no change │
│ QQuery 2  │        44.39 / 44.77 ±0.28 / 45.18 ms │        44.30 / 44.60 ±0.18 / 44.79 ms │     no change │
│ QQuery 3  │        41.90 / 44.66 ±3.02 / 49.28 ms │        42.28 / 45.70 ±2.09 / 48.23 ms │     no change │
│ QQuery 4  │     307.53 / 309.33 ±2.09 / 313.27 ms │     320.47 / 326.08 ±5.35 / 335.12 ms │  1.05x slower │
│ QQuery 5  │     360.58 / 364.09 ±3.61 / 370.91 ms │     370.97 / 377.62 ±6.57 / 389.19 ms │     no change │
│ QQuery 6  │           5.25 / 6.75 ±1.45 / 9.46 ms │           6.05 / 7.34 ±0.80 / 8.35 ms │  1.09x slower │
│ QQuery 7  │        17.25 / 17.98 ±0.60 / 18.67 ms │        17.23 / 17.43 ±0.17 / 17.73 ms │     no change │
│ QQuery 8  │    435.39 / 456.81 ±11.05 / 464.38 ms │    446.38 / 462.37 ±11.83 / 475.14 ms │     no change │
│ QQuery 9  │     691.06 / 700.98 ±8.21 / 715.87 ms │    697.99 / 713.75 ±13.43 / 733.65 ms │     no change │
│ QQuery 10 │       95.23 / 97.69 ±2.58 / 102.00 ms │        92.92 / 94.85 ±1.22 / 96.31 ms │     no change │
│ QQuery 11 │     108.16 / 109.61 ±1.52 / 112.36 ms │     105.24 / 106.92 ±1.62 / 109.54 ms │     no change │
│ QQuery 12 │     356.67 / 360.24 ±2.49 / 363.18 ms │     358.64 / 368.06 ±6.96 / 377.92 ms │     no change │
│ QQuery 13 │    469.38 / 488.74 ±11.95 / 506.64 ms │     481.77 / 497.10 ±9.45 / 507.12 ms │     no change │
│ QQuery 14 │     366.09 / 371.38 ±2.92 / 374.92 ms │     357.21 / 370.21 ±7.51 / 378.18 ms │     no change │
│ QQuery 15 │    374.55 / 391.21 ±14.55 / 414.89 ms │    379.14 / 390.51 ±15.65 / 421.55 ms │     no change │
│ QQuery 16 │    745.01 / 761.34 ±13.84 / 781.81 ms │    759.66 / 793.49 ±24.51 / 823.76 ms │     no change │
│ QQuery 17 │     742.67 / 751.26 ±8.84 / 767.18 ms │    736.39 / 756.75 ±13.28 / 776.65 ms │     no change │
│ QQuery 18 │ 1471.63 / 1546.60 ±59.45 / 1620.20 ms │ 1543.63 / 1555.39 ±11.24 / 1575.63 ms │     no change │
│ QQuery 19 │        36.61 / 41.25 ±5.69 / 51.88 ms │       37.86 / 45.14 ±12.97 / 71.04 ms │  1.09x slower │
│ QQuery 20 │    719.14 / 746.21 ±29.48 / 794.04 ms │    728.14 / 744.36 ±18.41 / 777.13 ms │     no change │
│ QQuery 21 │     764.18 / 769.47 ±4.64 / 776.91 ms │     775.44 / 778.80 ±1.96 / 780.65 ms │     no change │
│ QQuery 22 │ 1153.62 / 1163.52 ±11.47 / 1183.37 ms │  1151.37 / 1166.04 ±9.32 / 1180.51 ms │     no change │
│ QQuery 23 │ 3221.63 / 3258.32 ±30.91 / 3293.77 ms │ 3184.94 / 3216.27 ±25.60 / 3261.68 ms │     no change │
│ QQuery 24 │     104.85 / 109.52 ±3.23 / 114.73 ms │     104.04 / 106.71 ±2.00 / 110.25 ms │     no change │
│ QQuery 25 │     142.36 / 144.79 ±1.45 / 146.45 ms │     141.03 / 142.62 ±1.81 / 146.10 ms │     no change │
│ QQuery 26 │     105.19 / 106.60 ±1.12 / 107.64 ms │     104.56 / 106.88 ±1.73 / 109.03 ms │     no change │
│ QQuery 27 │     858.08 / 865.81 ±6.14 / 872.31 ms │     864.87 / 869.08 ±3.75 / 876.03 ms │     no change │
│ QQuery 28 │  3338.04 / 3351.97 ±9.33 / 3364.12 ms │  3351.19 / 3365.20 ±9.92 / 3380.25 ms │     no change │
│ QQuery 29 │        50.97 / 56.35 ±6.72 / 67.06 ms │        50.55 / 54.97 ±5.09 / 64.32 ms │     no change │
│ QQuery 30 │     379.31 / 384.28 ±5.65 / 391.26 ms │     369.92 / 377.62 ±5.07 / 384.23 ms │     no change │
│ QQuery 31 │    370.31 / 387.47 ±11.95 / 406.07 ms │    373.69 / 393.76 ±14.83 / 411.58 ms │     no change │
│ QQuery 32 │ 1261.98 / 1318.81 ±70.19 / 1447.44 ms │ 1098.34 / 1144.62 ±60.32 / 1255.35 ms │ +1.15x faster │
│ QQuery 33 │ 1649.15 / 1669.48 ±12.20 / 1687.23 ms │ 1567.62 / 1587.74 ±13.41 / 1603.87 ms │     no change │
│ QQuery 34 │ 1599.72 / 1665.75 ±35.21 / 1699.50 ms │  1601.02 / 1607.37 ±6.15 / 1617.20 ms │     no change │
│ QQuery 35 │     434.44 / 450.28 ±9.32 / 461.08 ms │    427.24 / 443.62 ±15.81 / 473.73 ms │     no change │
│ QQuery 36 │     120.65 / 126.30 ±3.77 / 132.00 ms │     125.79 / 129.21 ±2.08 / 132.23 ms │     no change │
│ QQuery 37 │        51.56 / 52.58 ±0.65 / 53.36 ms │        50.74 / 51.85 ±0.74 / 52.50 ms │     no change │
│ QQuery 38 │        77.27 / 79.31 ±1.26 / 81.10 ms │        79.46 / 81.01 ±1.42 / 83.01 ms │     no change │
│ QQuery 39 │     230.13 / 235.12 ±4.62 / 241.22 ms │     222.69 / 233.88 ±8.52 / 245.07 ms │     no change │
│ QQuery 40 │        25.33 / 26.24 ±0.95 / 27.69 ms │        26.20 / 27.64 ±1.43 / 30.08 ms │  1.05x slower │
│ QQuery 41 │        20.87 / 21.92 ±0.79 / 23.29 ms │        21.33 / 22.08 ±0.63 / 23.05 ms │     no change │
│ QQuery 42 │        20.22 / 21.21 ±0.62 / 21.88 ms │        20.90 / 21.16 ±0.27 / 21.50 ms │     no change │
└───────────┴───────────────────────────────────────┴───────────────────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                      ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                      │ 23895.39ms │
│ Total Time (optimize_count_distinct)   │ 23665.40ms │
│ Average Time (HEAD)                    │   555.71ms │
│ Average Time (optimize_count_distinct) │   550.36ms │
│ Queries Faster                         │          1 │
│ Queries Slower                         │          4 │
│ Queries with No Change                 │         38 │
│ Queries with Failure                   │          0 │
└────────────────────────────────────────┴────────────┘

Resource Usage

clickbench_partitioned — base (merge-base)

Metric	Value
Wall time	120.6s
Peak memory	37.7 GiB
Avg memory	27.3 GiB
CPU user	1120.8s
CPU sys	104.6s
Peak spill	0 B

clickbench_partitioned — branch

Metric	Value
Wall time	119.4s
Peak memory	37.9 GiB
Avg memory	27.8 GiB
CPU user	1123.7s
CPU sys	93.8s
Peak spill	0 B

File an issue against this benchmark runner

alamb · 2026-04-10T18:32:58Z

run benchmark clickbench_extended

adriangbot · 2026-04-10T18:34:22Z

🤖 Benchmark running (GKE) | trigger
Instance: c4a-highmem-16 (12 vCPU / 65 GiB) | Linux bench-c4225909959-1060-zlkvt 6.12.55+ #1 SMP Sun Feb 1 08:59:41 UTC 2026 aarch64 GNU/Linux

CPU Details (lscpu)

Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Comparing optimize_count_distinct (61dc8e1) to eaf0a41 (merge-base) diff using: clickbench_extended
Results will be posted here when complete

File an issue against this benchmark runner

adriangbot · 2026-04-10T18:52:01Z

🤖 Benchmark completed (GKE) | trigger

Instance: c4a-highmem-16 (12 vCPU / 65 GiB)

CPU Details (lscpu)

Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Details

Comparing HEAD and optimize_count_distinct
--------------------
Benchmark clickbench_extended.json
--------------------
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query     ┃                                     HEAD ┃                  optimize_count_distinct ┃        Change ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0  │       835.16 / 856.01 ±10.85 / 865.27 ms │       824.53 / 844.57 ±10.86 / 856.65 ms │     no change │
│ QQuery 1  │        213.67 / 214.24 ±0.47 / 214.99 ms │        208.28 / 209.91 ±1.90 / 213.53 ms │     no change │
│ QQuery 2  │        499.10 / 504.55 ±3.13 / 508.65 ms │        494.71 / 499.39 ±3.44 / 504.15 ms │     no change │
│ QQuery 3  │        315.60 / 317.07 ±0.86 / 318.24 ms │        314.72 / 315.84 ±0.65 / 316.68 ms │     no change │
│ QQuery 4  │       667.76 / 688.59 ±14.70 / 707.36 ms │       656.16 / 703.92 ±38.37 / 749.28 ms │     no change │
│ QQuery 5  │ 9631.38 / 10016.72 ±261.11 / 10359.86 ms │ 9874.75 / 10090.11 ±224.54 / 10371.86 ms │     no change │
│ QQuery 6  │    1005.02 / 1040.59 ±47.32 / 1129.82 ms │    1001.30 / 1010.79 ±16.05 / 1042.80 ms │     no change │
│ QQuery 7  │       884.26 / 934.13 ±28.89 / 968.79 ms │       812.89 / 921.70 ±56.94 / 980.85 ms │     no change │
│ QQuery 8  │        402.52 / 410.90 ±9.25 / 427.75 ms │        404.21 / 414.25 ±8.47 / 427.85 ms │     no change │
│ QQuery 9  │    2822.09 / 2856.82 ±28.90 / 2908.08 ms │    2910.47 / 2924.78 ±13.70 / 2947.79 ms │     no change │
│ QQuery 10 │        631.58 / 642.27 ±8.22 / 654.22 ms │       633.70 / 648.82 ±15.84 / 675.26 ms │     no change │
│ QQuery 11 │    2196.48 / 2215.66 ±11.80 / 2232.50 ms │    2028.49 / 2079.80 ±28.50 / 2114.52 ms │ +1.07x faster │
│ QQuery 12 │        205.89 / 210.51 ±3.93 / 217.34 ms │        201.22 / 209.79 ±7.45 / 221.54 ms │     no change │
└───────────┴──────────────────────────────────────────┴──────────────────────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                      ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                      │ 20908.07ms │
│ Total Time (optimize_count_distinct)   │ 20873.67ms │
│ Average Time (HEAD)                    │  1608.31ms │
│ Average Time (optimize_count_distinct) │  1605.67ms │
│ Queries Faster                         │          1 │
│ Queries Slower                         │          0 │
│ Queries with No Change                 │         12 │
│ Queries with Failure                   │          0 │
└────────────────────────────────────────┴────────────┘

Resource Usage

clickbench_extended — base (merge-base)

Metric	Value
Wall time	105.4s
Peak memory	32.5 GiB
Avg memory	27.3 GiB
CPU user	1003.5s
CPU sys	54.9s
Peak spill	0 B

clickbench_extended — branch

Metric	Value
Wall time	105.2s
Peak memory	31.1 GiB
Avg memory	25.8 GiB
CPU user	1005.5s
CPU sys	57.2s
Peak spill	0 B

File an issue against this benchmark runner

alamb · 2026-04-10T19:45:48Z

Query 0 in clickbench_extended dataset (which uses count distinct on u8 is now ~ 11 % faster :

THe benchmark runner doesn't seem to be able to reproduce that result 🤔

Also the microbenchmarks seem to still show a 10% slowdown: #21456 (comment)

For i64 which isn't changed by this patch

count_distinct i64 80% distinct    1.00    101.5±0.26µs        ? ?/sec    1.11    112.9±0.36µs        ? ?/sec
count_distinct i64 99% distinct    1.00    102.1±0.21µs        ? ?/sec    1.13    115.1±0.46µs        ? ?/sec

coderfender · 2026-04-10T19:49:22Z

@alamb , we added bitmaps for only 8 and 16 bit datatypes . The benchmarks show a 8x - 25x speedup compared to main

count_distinct i16 bitmap                      1.00      3.1±0.14µs        ? ?/sec    25.35    77.9±3.43µs        ? ?/sec
count_distinct i8 bitmap                       1.00      2.9±0.43µs        ? ?/sec    5.75     16.4±0.26µs        ? ?/sec
count_distinct u16 bitmap                      1.00      3.2±0.30µs        ? ?/sec    24.08    77.7±1.82µs        ? ?/sec
count_distinct u8 bitmap                       1.00      2.1±0.03µs        ? ?/sec    8.14     16.8±0.05µs        ? ?/sec

coderfender · 2026-04-10T21:13:28Z

Investigating why untouched paths are running slower

coderfender · 2026-04-10T21:38:23Z

Recent benchmarks :

( not sure if this is due to variance with the build env )

group                              bitmap_count_distinct                  main
-----                              ---------------------                  ----
count_distinct i16 bitmap          1.00      3.1±0.27µs        ? ?/sec    25.69    80.7±0.62µs        ? ?/sec
count_distinct i64 80% distinct    1.00     48.7±0.49µs        ? ?/sec    1.00     48.9±1.01µs        ? ?/sec
count_distinct i64 99% distinct    1.00     49.0±1.97µs        ? ?/sec    1.04     51.0±3.38µs        ? ?/sec
count_distinct i8 bitmap           1.00      2.2±0.18µs        ? ?/sec    7.60     17.0±0.16µs        ? ?/sec
count_distinct u16 bitmap          1.00      3.1±0.17µs        ? ?/sec    25.99    81.4±0.84µs        ? ?/sec
count_distinct u8 bitmap           1.00      2.0±0.01µs        ? ?/sec    8.42     17.2±0.17µs        ? ?/sec

alamb · 2026-04-10T21:56:53Z

run benchmark count_distinct

datafusion/functions-aggregate-common/src/aggregate/count_distinct/native.rs

adriangbot · 2026-04-10T21:59:26Z

🤖 Criterion benchmark running (GKE) | trigger
Instance: c4a-highmem-16 (12 vCPU / 65 GiB) | Linux bench-c4227023688-1062-bkn4c 6.12.55+ #1 SMP Sun Feb 1 08:59:41 UTC 2026 aarch64 GNU/Linux

CPU Details (lscpu)

Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Comparing optimize_count_distinct (61dc8e1) to eaf0a41 (merge-base) diff
BENCH_NAME=count_distinct
BENCH_COMMAND=cargo bench --features=parquet --bench count_distinct
BENCH_FILTER=
Results will be posted here when complete

File an issue against this benchmark runner

adriangbot · 2026-04-10T22:03:00Z

🤖 Criterion benchmark completed (GKE) | trigger

Instance: c4a-highmem-16 (12 vCPU / 65 GiB)

CPU Details (lscpu)

Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Details

group                              main                                   optimize_count_distinct
-----                              ----                                   -----------------------
count_distinct i16 bitmap          17.96   164.7±0.26µs        ? ?/sec    1.00      9.2±0.73µs        ? ?/sec
count_distinct i64 80% distinct    1.00    102.2±0.40µs        ? ?/sec    1.11    113.0±0.47µs        ? ?/sec
count_distinct i64 99% distinct    1.00    102.2±0.35µs        ? ?/sec    1.12    114.9±0.32µs        ? ?/sec
count_distinct i8 bitmap           5.32     31.1±0.11µs        ? ?/sec    1.00      5.8±0.00µs        ? ?/sec
count_distinct u16 bitmap          26.41   157.3±2.49µs        ? ?/sec    1.00      6.0±0.20µs        ? ?/sec
count_distinct u8 bitmap           5.22     30.9±0.03µs        ? ?/sec    1.00      5.9±0.07µs        ? ?/sec

Resource Usage

base (merge-base)

Metric	Value
Wall time	59.8s
Peak memory	3.5 GiB
Avg memory	3.5 GiB
CPU user	74.0s
CPU sys	1.0s
Peak spill	0 B

branch

Metric	Value
Wall time	53.9s
Peak memory	3.5 GiB
Avg memory	3.5 GiB
CPU user	68.8s
CPU sys	0.2s
Peak spill	0 B

File an issue against this benchmark runner

coderfender · 2026-04-10T22:10:31Z

Okay this still seems to be an issue. Let me try and see if I can add additional hints to the compiler and see if that helps not regress existing hotpaths for i64

…inct' into optimize_count_distinct

adriangbot · 2026-04-11T02:43:49Z

Hi @coderfender, thanks for the request (#21456 (comment)). Only whitelisted users can trigger benchmarks. Allowed users: Dandandan, Fokko, Jefffrey, Omega359, adriangb, alamb, asubiotto, brunal, buraksenn, cetra3, codephage2020, comphead, erenavsarogullari, etseidl, friendlymatthew, gabotechs, geoffreyclaude, grtlr, haohuaijin, jonathanc-n, kevinjqliu, klion26, kosiew, kumarUjjawal, kunalsinghdadhwal, liamzwbao, mbutrovich, mzabaluev, neilconway, rluvaton, sdf-jkl, timsaucer, xudong963, zhuqi-lucas.

File an issue against this benchmark runner

github-actions bot added the functions Changes to functions implementation label Apr 8, 2026

This comment has been minimized.

Sign in to view

alamb added the performance Make DataFusion faster label Apr 9, 2026

alamb reviewed Apr 9, 2026

View reviewed changes

datafusion/functions-aggregate/Cargo.toml Show resolved Hide resolved

This comment has been minimized.

Sign in to view

alamb reviewed Apr 9, 2026

View reviewed changes

datafusion/functions-aggregate-common/src/aggregate/count_distinct/native.rs Outdated Show resolved Hide resolved

This comment has been minimized.

Sign in to view

coderfender mentioned this pull request Apr 9, 2026

chore: create benches small ints for count_distinct #21521

Merged

coderfender force-pushed the optimize_count_distinct branch from 93acd98 to 7e67e2e Compare April 9, 2026 18:53

coderfender added 7 commits April 9, 2026 14:33

bitmap_smaller_datatypes

8d49dfe

bitmap_smaller_datatypes

0b179ff

bitmap_instead_of_hll_smaller_datatypes

c6095ab

bitmap_instead_of_hll_smaller_datatypes

9d06408

bitmap_instead_of_hll_smaller_datatypes

f185fdc

bitmap_instead_smaller_datatypes

f7c487a

bitmap_instead_smaller_datatypes

3f091d9

rebase_main

df90ef5

coderfender force-pushed the optimize_count_distinct branch from 3db92e3 to df90ef5 Compare April 9, 2026 21:38

Merge branch 'main' into optimize_count_distinct

bf0f95c

Merge branch 'main' into optimize_count_distinct

48a6029

coderfender changed the title ~~perf : Optimize count distinct~~ perf : Optimize count distinct using bitmaps instead of hashsets for smaller datatypes Apr 9, 2026

Merge branch 'main' into optimize_count_distinct

61dc8e1

alamb reviewed Apr 10, 2026

View reviewed changes

datafusion/functions-aggregate-common/src/aggregate/count_distinct/native.rs Outdated Show resolved Hide resolved

coderfender added 4 commits April 10, 2026 15:20

remove_boxing_smaller_int_types

5a2918a

Merge remote-tracking branch 'refs/remotes/origin/optimize_count_dist…

3f4952e

…inct' into optimize_count_distinct

remove_boxing_smaller_int_types

289b354

never_inline_bitmap_accumulators

554f60c

Conversation

coderfender commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

coderfender commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Dandandan commented Apr 8, 2026

Uh oh!

Dandandan commented Apr 8, 2026

Uh oh!

coderfender commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderfender commented Apr 8, 2026

Uh oh!

This comment has been minimized.

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

Uh oh!

This comment has been minimized.

This comment has been minimized.

alamb commented Apr 9, 2026

Uh oh!

alamb commented Apr 9, 2026

Uh oh!

adriangbot commented Apr 9, 2026

Uh oh!

adriangbot commented Apr 9, 2026

Uh oh!

adriangbot commented Apr 9, 2026

Uh oh!

adriangbot commented Apr 9, 2026

Uh oh!

coderfender commented Apr 9, 2026

Uh oh!

alamb commented Apr 10, 2026

Uh oh!

alamb commented Apr 10, 2026

Uh oh!

adriangbot commented Apr 10, 2026

Uh oh!

adriangbot commented Apr 10, 2026

Uh oh!

adriangbot commented Apr 10, 2026

Uh oh!

adriangbot commented Apr 10, 2026

Uh oh!

alamb commented Apr 10, 2026

Uh oh!

adriangbot commented Apr 10, 2026

Uh oh!

adriangbot commented Apr 10, 2026

Uh oh!

alamb commented Apr 10, 2026

Uh oh!

coderfender commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderfender commented Apr 10, 2026

Uh oh!

coderfender commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alamb commented Apr 10, 2026

Uh oh!

Uh oh!

adriangbot commented Apr 10, 2026

Uh oh!

adriangbot commented Apr 10, 2026

coderfender commented Apr 8, 2026 •

edited

Loading

coderfender commented Apr 8, 2026 •

edited

Loading

coderfender commented Apr 8, 2026 •

edited

Loading

coderfender commented Apr 10, 2026 •

edited

Loading

coderfender commented Apr 10, 2026 •

edited

Loading