chore: add count distinct group benchmarks#21575
chore: add count distinct group benchmarks#21575coderfender wants to merge 6 commits intoapache:mainfrom
Conversation
9413a74 to
bacee66
Compare
3abdd0d to
45a19b0
Compare
|
@Dandandan , I plan to add benches to help better evaluate group accumulators along with direct clickbench / TPCH queries for implementing group accumulators . Please take a look whenever you get a chance |
| if let Some(val) = arr.value(idx).into() { | ||
| let single_val = | ||
| Arc::new(Int64Array::from(vec![Some(val)])) as ArrayRef; | ||
| accumulators[*group_idx] | ||
| .update_batch(std::slice::from_ref(&single_val)) | ||
| .unwrap(); | ||
| } |
There was a problem hiding this comment.
| if let Some(val) = arr.value(idx).into() { | |
| let single_val = | |
| Arc::new(Int64Array::from(vec![Some(val)])) as ArrayRef; | |
| accumulators[*group_idx] | |
| .update_batch(std::slice::from_ref(&single_val)) | |
| .unwrap(); | |
| } | |
| let single_val = values.slice(idx, 1); | |
| accumulators[*group_idx] | |
| .update_batch(std::slice::from_ref(&single_val)) | |
| .unwrap(); |
slightly simpler and would avoid the allocation of the single valued array for each row
There was a problem hiding this comment.
Another way would be to collect per-group indices first and then build the array:
let mut group_rows: Vec<Vec<i64>> = vec![Vec::new(); num_groups];
for (idx, &group_idx) in group_indices.iter().enumerate() {
if arr.is_valid(idx) {
group_rows[group_idx].push(arr.value(idx));
}
}
for (group_idx, rows) in group_rows.iter().enumerate() {
if !rows.is_empty() {
let batch = Arc::new(Int64Array::from(rows.clone())) as ArrayRef;
accumulators[group_idx].update_batch(std::slice::from_ref(&batch)).unwrap();
}
}|
|
||
| let arr = values.as_any().downcast_ref::<Int64Array>().unwrap(); | ||
| for (idx, group_idx) in group_indices.iter().enumerate() { | ||
| if let Some(val) = arr.value(idx).into() { |
There was a problem hiding this comment.
Why do you need an Option here ?
arr.value(idx) returns i64 and calling .into() always return Some.
If you want to filter out the nulls then you need to use arr.is_null(idx)
| for (name, num_groups, distinct_pct, group_type) in scenarios { | ||
| let n_distinct = BATCH_SIZE * distinct_pct / 100; | ||
| let values = Arc::new(create_i64_array(n_distinct)) as ArrayRef; | ||
| let group_indices = if group_type == "uniform" { |
There was a problem hiding this comment.
nit: Introduce an enum instead of using strings:
enum GroupDist {
Uniform,
Skewed
}
Which issue does this PR close?
Add benchmarks for group accumulators to test : #21561
The implementation forks out based on
is_groups_accumulator_supportedfunction call. Once this is merged , we should be able to evaluate group accumulators on count distinct exprRationale for this change
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?