[Evaluation] Enable ToolCallAccuracy/Input/Success on restricted-tool conversations#47369
Open
mmkawale wants to merge 2 commits into
Open
[Evaluation] Enable ToolCallAccuracy/Input/Success on restricted-tool conversations#47369mmkawale wants to merge 2 commits into
mmkawale wants to merge 2 commits into
Conversation
These three evaluators grade the agent's tool selection, input arguments,
and call status -- none consume the (redacted) tool output body -- so the
previous unconditional rejection of conversations containing built-in
restricted tools (bing_grounding, bing_custom_search, azure_ai_search,
azure_fabric, sharepoint_grounding) is now lifted.
Implementation:
- Set check_for_unsupported_tools=False on each evaluator's input validator
in _tool_call_accuracy.py, _tool_input_accuracy.py, _tool_call_success.py.
- The underlying ToolDefinitionsValidator / ToolCallsValidator classes are
unchanged; GroundednessEvaluator and ToolOutputUtilizationEvaluator still
reject restricted tools because they require the tool output body.
Tests:
- New test_unsupported_tools_validation.py (26 tests) covers:
* 15 parametrized cases: each of the 3 evaluators x 5 restricted tools,
asserting validate_eval_input returns True for response= payloads.
* 1 mixed-tools case.
* 10 regression cases asserting the underlying validators still reject
restricted tools when check_for_unsupported_tools=True.
Versioning:
- Bumped _version.py 1.17.0 -> 1.17.1.
- Added 1.17.1 (Unreleased) section to CHANGELOG.md under Features Added.
Contributor
|
Thank you for your contribution @mmkawale! We will review the pull request and get back to you soon. |
Contributor
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Updates tool-related evaluators to allow conversations that include restricted built-in tools by disabling unsupported-tool checks in their input validators, and adds regression tests to ensure the relaxed behavior is limited to those evaluators.
Changes:
- Set
check_for_unsupported_tools=FalseforToolCallAccuracyEvaluator,_ToolInputAccuracyEvaluator, and_ToolCallSuccessEvaluatorvalidators. - Added unit tests covering acceptance of restricted tools for those evaluators and continued rejection when validator flags are enabled.
- Bumped package version and documented the behavior change in the changelog.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_tool_call_accuracy/_tool_call_accuracy.py | Disables unsupported-tool checking in ToolCallsValidator wiring. |
| sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_tool_input_accuracy/_tool_input_accuracy.py | Disables unsupported-tool checking in ToolDefinitionsValidator wiring. |
| sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_tool_call_success/_tool_call_success.py | Disables unsupported-tool checking in ToolDefinitionsValidator wiring. |
| sdk/evaluation/azure-ai-evaluation/tests/unittests/test_unsupported_tools_validation.py | Adds regression tests ensuring restricted tools are accepted only where intended. |
| sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_version.py | Bumps version to 1.17.1. |
| sdk/evaluation/azure-ai-evaluation/CHANGELOG.md | Documents the new behavior under 1.17.1 (Unreleased). |
Comment on lines
+83
to
+84
| # Should not raise EvaluationException; flag flip made this path legal. | ||
| assert evaluator._validator.validate_eval_input(eval_input) is True |
Comment on lines
+70
to
+73
| @pytest.mark.usefixtures("mock_model_config") | ||
| @pytest.mark.unittest | ||
| class TestRestrictedToolValidationLifted: | ||
| """Validator should no longer reject restricted tools for these three evaluators.""" |
Comment on lines
+59
to
+67
| def _restricted_tool_definition(tool_name: str): | ||
| return { | ||
| "name": tool_name, | ||
| "description": f"Built-in {tool_name} tool.", | ||
| "parameters": { | ||
| "type": "object", | ||
| "properties": {"query": {"type": "string"}}, | ||
| }, | ||
| } |
Comment on lines
+34
to
+40
| RESTRICTED_TOOL_NAMES = [ | ||
| "bing_grounding", | ||
| "bing_custom_search", | ||
| "azure_ai_search", | ||
| "azure_fabric", | ||
| "sharepoint_grounding", | ||
| ] |
When any tool_call or tool_result in the response carries a known-failure status (failed, error, incomplete, cancelled/canceled), short-circuit _do_eval to return a deterministic fail result (score=0, _passed=False, _result='fail') without invoking the LLM. The evaluator's scoring contract is explicitly binary -- 'FALSE: at least one tool call failed' -- and the prompty rubric does not consider the status field, so it would otherwise grade only the (typically empty) result body and frequently mis-score failed conversations as passes. Reuses the existing pre-flow short-circuit pattern (_is_intermediate_response / _return_not_applicable_result) for consistency. Status is only populated by upstream converters that preserve it; absent status, behavior is unchanged. Bumps to 1.17.1, adds CHANGELOG entry, and adds 19 focused unit tests.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Enable
ToolCallAccuracyEvaluator,_ToolInputAccuracyEvaluator, and_ToolCallSuccessEvaluatoron conversations that contain built-in restricted tools (bing_grounding,bing_custom_search,azure_ai_search,azure_fabric,sharepoint_grounding).These three evaluators grade the agent's tool selection, input arguments, and call status — none of them consume the (redacted) tool output body — so the previous blanket rejection was overly conservative.
GroundednessEvaluatorandToolOutputUtilizationEvaluatorstill reject restricted tools because they require the tool output body.Implementation
check_for_unsupported_tools=Falseon each evaluator's input validator in:_tool_call_accuracy.py_tool_input_accuracy.py_tool_call_success.pyToolDefinitionsValidator/ToolCallsValidatorclasses are unchanged; restricted-tool rejection remains the default for any other caller.Tests
New
test_unsupported_tools_validation.py(26 tests, all passing):validate_eval_inputreturnsTrueforresponse=payloads.check_for_unsupported_tools=True(default for other callers).Versioning
_version.py1.17.0 -> 1.17.1.1.17.1 (Unreleased)section toCHANGELOG.mdunder Features Added.Related
Parallel change in
azureml-assetsregistry (separate PR) bumps the published evaluator versions so the ACA / Foundry batch path picks up the same behavior.