Skip to content

GH-49985: [C++][Gandiva] Duplicate function aliases with same parameters#49986

Open
lriggs wants to merge 10 commits into
apache:mainfrom
lriggs:dualAliasFix
Open

GH-49985: [C++][Gandiva] Duplicate function aliases with same parameters#49986
lriggs wants to merge 10 commits into
apache:mainfrom
lriggs:dualAliasFix

Conversation

@lriggs
Copy link
Copy Markdown
Contributor

@lriggs lriggs commented May 18, 2026

Rationale for this change

Silently allowing duplicate functions that only differ by return type provides a confusing function registry. Callers do not typically expect the return type to be part of determining which function to call leading to ambiguity.

It is possible to register different Gandiva functions with the same alias and parameters but different return types, resulting in confusing function overloads.

For example, DATE_EXTRACTION_TRUNCATION_FNS in [cpp/src/gandiva/function_registry_datetime.cc] was invoked twice with the same SQL alias lists — once for extract* (returns int64) and once for date_trunc_* (returns the input date/timestamp type):


DATE_EXTRACTION_TRUNCATION_FNS(EXTRACT_SAFE_NULL_IF_NULL, extract)
DATE_EXTRACTION_TRUNCATION_FNS(TRUNCATE_SAFE_NULL_IF_NULL, date_trunc_)

As a result the registry contained four entries for day(...) where there should have been two:

int64 day(timestamp)   → extractDay_timestamp
int64 day(date)        → extractDay_date64
timestamp day(timestamp) → date_trunc_Day_timestamp
date day(date)         → date_trunc_Day_date64

The same problem existed for every calendar-unit alias: year, month, quarter, week, weekofyear, yearweek, dayofmonth, hour, minute, second. Resolution behavior depended on the caller's inferred return type, which is not the SQL semantics anyone expects from day(timestamp_col).

FunctionRegistry::Add was silently allowing these registrations: unordered_map::emplace keeps the first entry and discards subsequent ones with no warning.

What changes are included in this PR?

Diagnostics

Added two ARROW_LOG(ERROR) checks inside FunctionRegistry::Add:
Duplicate-signature check — when pc_registry_map_.emplace reports the entry already existed (same name + params + return type), log the conflict including both pc_names.
Alias-collision check — a new call_shape_map_ keyed on (lower(name), param_type_ids) (return type excluded) detects the case where the same name(args) could resolve to two functions with different return types. This is the check that catches the day family.
These run at registry-construction time, so future regressions surface in test output rather than being silently shadowed.

Macro split

DATE_EXTRACTION_TRUNCATION_FNS is split into two macros:

DATE_EXTRACTION_FNS — keeps the existing SQL alias lists ({"year"}, {"day", "dayofmonth"}, etc.).
DATE_TRUNCATION_FNS — every alias list is {}. The truncate functions are still reachable through their date_trunc_* base names.

Strengthened registry test

Added a third loop to TestNoDuplicates that mirrors the runtime alias-collision check: keys each registered signature on (lower(name), param_types) only and fails if two signatures share that key but disagree on return type. The two pre-existing loops still cover their original cases (same pc_name + params + ret, and same full signature appearing twice).

Cleanup

Removed the standalone NativeFunction("add", ..., date64+int64 → timestamp, "add_date64_int64"). DATE_ADD_FNS(add, {}) already provides the entry with the correct → date64 return type matching the precompiled symbol.

Dropped the {"convert_fromutf8"} and {"convert_replaceutf8"} aliases; the base names already match case-insensitively.

Are these changes tested?

Registry construction is now silent — no duplicate or alias-collision logs.
TestFunctionRegistry — 6/6 pass, including the strengthened TestNoDuplicates.
ctest -R gandiva — 4/4 binaries pass (gandiva-internals-test, gandiva-precompiled-test, gandiva-projector-test, gandiva-projector-test-static).

Are there any user-facing changes?

The Gandiva function list is different, with some aliases no longer appearing. Functionality is still available.

@github-actions
Copy link
Copy Markdown

⚠️ GitHub issue #49985 has been automatically assigned in GitHub to PR creator.

@kou kou changed the title GH-49985: Duplicate function aliases with same parameters GH-49985: [C++][Gandiva] Duplicate function aliases with same parameters May 25, 2026
@kou kou requested a review from Copilot May 25, 2026 21:10
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves the Gandiva C++ function registry’s diagnostics and correctness by detecting (and preventing regressions of) ambiguous function registrations where the same name(args) can map to different return types. It also removes/adjusts a few aliases and signatures that previously caused confusing overloads.

Changes:

  • Added call-shape collision detection (name + param types, ignoring return type) and duplicate-signature logging during FunctionRegistry::Add.
  • Split the date extraction vs truncation registration macros to avoid SQL-alias collisions (e.g., day(timestamp) no longer ambiguously maps to both extract and trunc variants).
  • Strengthened TestNoDuplicates and adjusted/removes specific problematic registrations/aliases (e.g., redundant UTF8 aliases, incorrect add_date64_int64 registration).

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
cpp/src/gandiva/precompiled/types.h Aligns a few precompiled stub signatures with intended date/timestamp return types.
cpp/src/gandiva/function_signature.h Adds CallShape() API to compute a return-type-agnostic signature key.
cpp/src/gandiva/function_signature.cc Implements CallShape() with decimal handling consistent with signature identity rules.
cpp/src/gandiva/function_registry.h Adds call_shape_map_ member to track and detect call-shape collisions.
cpp/src/gandiva/function_registry.cc Logs duplicate signatures and logs call-shape collisions during registry construction.
cpp/src/gandiva/function_registry_timestamp_arithmetic.cc Removes a redundant/incorrect add(date64, int64) -> timestamp registration.
cpp/src/gandiva/function_registry_test.cc Adds focused tests for new diagnostics and strengthens the no-duplicates test.
cpp/src/gandiva/function_registry_string.cc Removes redundant UTF8 aliases that were ineffective under case-insensitive lookup.
cpp/src/gandiva/function_registry_datetime.cc Splits extraction vs truncation macros so truncation functions don’t register conflicting SQL aliases.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 85 to 92
private:
std::vector<NativeFunction> pc_registry_;
SignatureMap pc_registry_map_;
// Tracks name+param-types (return type ignored) to detect call-shape collisions
// where the same `name(args)` could resolve to two functions with different
// return types.
std::unordered_map<std::string, const FunctionSignature*> call_shape_map_;
std::vector<std::shared_ptr<arrow::Buffer>> bitcode_memory_buffers_;
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants