GH-49985: [C++][Gandiva] Duplicate function aliases with same parameters#49986
Open
lriggs wants to merge 10 commits into
Open
GH-49985: [C++][Gandiva] Duplicate function aliases with same parameters#49986lriggs wants to merge 10 commits into
lriggs wants to merge 10 commits into
Conversation
|
|
Contributor
There was a problem hiding this comment.
Pull request overview
This PR improves the Gandiva C++ function registry’s diagnostics and correctness by detecting (and preventing regressions of) ambiguous function registrations where the same name(args) can map to different return types. It also removes/adjusts a few aliases and signatures that previously caused confusing overloads.
Changes:
- Added call-shape collision detection (name + param types, ignoring return type) and duplicate-signature logging during
FunctionRegistry::Add. - Split the date extraction vs truncation registration macros to avoid SQL-alias collisions (e.g.,
day(timestamp)no longer ambiguously maps to both extract and trunc variants). - Strengthened
TestNoDuplicatesand adjusted/removes specific problematic registrations/aliases (e.g., redundant UTF8 aliases, incorrectadd_date64_int64registration).
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| cpp/src/gandiva/precompiled/types.h | Aligns a few precompiled stub signatures with intended date/timestamp return types. |
| cpp/src/gandiva/function_signature.h | Adds CallShape() API to compute a return-type-agnostic signature key. |
| cpp/src/gandiva/function_signature.cc | Implements CallShape() with decimal handling consistent with signature identity rules. |
| cpp/src/gandiva/function_registry.h | Adds call_shape_map_ member to track and detect call-shape collisions. |
| cpp/src/gandiva/function_registry.cc | Logs duplicate signatures and logs call-shape collisions during registry construction. |
| cpp/src/gandiva/function_registry_timestamp_arithmetic.cc | Removes a redundant/incorrect add(date64, int64) -> timestamp registration. |
| cpp/src/gandiva/function_registry_test.cc | Adds focused tests for new diagnostics and strengthens the no-duplicates test. |
| cpp/src/gandiva/function_registry_string.cc | Removes redundant UTF8 aliases that were ineffective under case-insensitive lookup. |
| cpp/src/gandiva/function_registry_datetime.cc | Splits extraction vs truncation macros so truncation functions don’t register conflicting SQL aliases. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
85
to
92
| private: | ||
| std::vector<NativeFunction> pc_registry_; | ||
| SignatureMap pc_registry_map_; | ||
| // Tracks name+param-types (return type ignored) to detect call-shape collisions | ||
| // where the same `name(args)` could resolve to two functions with different | ||
| // return types. | ||
| std::unordered_map<std::string, const FunctionSignature*> call_shape_map_; | ||
| std::vector<std::shared_ptr<arrow::Buffer>> bitcode_memory_buffers_; |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Rationale for this change
Silently allowing duplicate functions that only differ by return type provides a confusing function registry. Callers do not typically expect the return type to be part of determining which function to call leading to ambiguity.
It is possible to register different Gandiva functions with the same alias and parameters but different return types, resulting in confusing function overloads.
For example, DATE_EXTRACTION_TRUNCATION_FNS in [cpp/src/gandiva/function_registry_datetime.cc] was invoked twice with the same SQL alias lists — once for extract* (returns int64) and once for date_trunc_* (returns the input date/timestamp type):
As a result the registry contained four entries for day(...) where there should have been two:
The same problem existed for every calendar-unit alias: year, month, quarter, week, weekofyear, yearweek, dayofmonth, hour, minute, second. Resolution behavior depended on the caller's inferred return type, which is not the SQL semantics anyone expects from day(timestamp_col).
FunctionRegistry::Add was silently allowing these registrations: unordered_map::emplace keeps the first entry and discards subsequent ones with no warning.
What changes are included in this PR?
Diagnostics
Added two ARROW_LOG(ERROR) checks inside FunctionRegistry::Add:
Duplicate-signature check — when pc_registry_map_.emplace reports the entry already existed (same name + params + return type), log the conflict including both pc_names.
Alias-collision check — a new call_shape_map_ keyed on (lower(name), param_type_ids) (return type excluded) detects the case where the same name(args) could resolve to two functions with different return types. This is the check that catches the day family.
These run at registry-construction time, so future regressions surface in test output rather than being silently shadowed.
Macro split
DATE_EXTRACTION_TRUNCATION_FNS is split into two macros:
DATE_EXTRACTION_FNS — keeps the existing SQL alias lists ({"year"}, {"day", "dayofmonth"}, etc.).
DATE_TRUNCATION_FNS — every alias list is {}. The truncate functions are still reachable through their date_trunc_* base names.
Strengthened registry test
Added a third loop to TestNoDuplicates that mirrors the runtime alias-collision check: keys each registered signature on (lower(name), param_types) only and fails if two signatures share that key but disagree on return type. The two pre-existing loops still cover their original cases (same pc_name + params + ret, and same full signature appearing twice).
Cleanup
Removed the standalone NativeFunction("add", ..., date64+int64 → timestamp, "add_date64_int64"). DATE_ADD_FNS(add, {}) already provides the entry with the correct → date64 return type matching the precompiled symbol.
Dropped the {"convert_fromutf8"} and {"convert_replaceutf8"} aliases; the base names already match case-insensitively.
Are these changes tested?
Registry construction is now silent — no duplicate or alias-collision logs.
TestFunctionRegistry — 6/6 pass, including the strengthened TestNoDuplicates.
ctest -R gandiva — 4/4 binaries pass (gandiva-internals-test, gandiva-precompiled-test, gandiva-projector-test, gandiva-projector-test-static).
Are there any user-facing changes?
The Gandiva function list is different, with some aliases no longer appearing. Functionality is still available.