[Evaluation] Enable ToolCallAccuracy/Input/Success on restricted-tool conversations by mmkawale · Pull Request #47369 · Azure/azure-sdk-for-python

mmkawale · 2026-06-05T17:08:39Z

Summary

Enable ToolCallAccuracyEvaluator, _ToolInputAccuracyEvaluator, and _ToolCallSuccessEvaluator on conversations that contain built-in restricted tools (bing_grounding, bing_custom_search, azure_ai_search, azure_fabric, sharepoint_grounding).

These three evaluators grade the agent's tool selection, input arguments, and call status — none of them consume the (redacted) tool output body — so the previous blanket rejection was overly conservative. GroundednessEvaluator and ToolOutputUtilizationEvaluator still reject restricted tools because they require the tool output body.

Implementation

Set check_for_unsupported_tools=False on each evaluator's input validator in:
- _tool_call_accuracy.py
- _tool_input_accuracy.py
- _tool_call_success.py
The underlying ToolDefinitionsValidator / ToolCallsValidator classes are unchanged; restricted-tool rejection remains the default for any other caller.

Tests

New test_unsupported_tools_validation.py (26 tests, all passing):

15 parametrized cases — each of the 3 evaluators x 5 restricted tools, asserting validate_eval_input returns True for response= payloads.
1 mixed-tools case.
10 regression cases asserting the underlying validators still reject restricted tools when check_for_unsupported_tools=True (default for other callers).

Versioning

Bumped _version.py 1.17.0 -> 1.17.1.
Added 1.17.1 (Unreleased) section to CHANGELOG.md under Features Added.

These three evaluators grade the agent's tool selection, input arguments, and call status -- none consume the (redacted) tool output body -- so the previous unconditional rejection of conversations containing built-in restricted tools (bing_grounding, bing_custom_search, azure_ai_search, azure_fabric, sharepoint_grounding) is now lifted. Implementation: - Set check_for_unsupported_tools=False on each evaluator's input validator in _tool_call_accuracy.py, _tool_input_accuracy.py, _tool_call_success.py. - The underlying ToolDefinitionsValidator / ToolCallsValidator classes are unchanged; GroundednessEvaluator and ToolOutputUtilizationEvaluator still reject restricted tools because they require the tool output body. Tests: - New test_unsupported_tools_validation.py (26 tests) covers: * 15 parametrized cases: each of the 3 evaluators x 5 restricted tools, asserting validate_eval_input returns True for response= payloads. * 1 mixed-tools case. * 10 regression cases asserting the underlying validators still reject restricted tools when check_for_unsupported_tools=True. Versioning: - Bumped _version.py 1.17.0 -> 1.17.1. - Added 1.17.1 (Unreleased) section to CHANGELOG.md under Features Added.

github-actions · 2026-06-05T17:09:07Z

Thank you for your contribution @mmkawale! We will review the pull request and get back to you soon.

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Updates tool-related evaluators to allow conversations that include restricted built-in tools by disabling unsupported-tool checks in their input validators, and adds regression tests to ensure the relaxed behavior is limited to those evaluators.

Changes:

Set check_for_unsupported_tools=False for ToolCallAccuracyEvaluator, _ToolInputAccuracyEvaluator, and _ToolCallSuccessEvaluator validators.
Added unit tests covering acceptance of restricted tools for those evaluators and continued rejection when validator flags are enabled.
Bumped package version and documented the behavior change in the changelog.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_tool_call_accuracy/_tool_call_accuracy.py	Disables unsupported-tool checking in `ToolCallsValidator` wiring.
sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_tool_input_accuracy/_tool_input_accuracy.py	Disables unsupported-tool checking in `ToolDefinitionsValidator` wiring.
sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_tool_call_success/_tool_call_success.py	Disables unsupported-tool checking in `ToolDefinitionsValidator` wiring.
sdk/evaluation/azure-ai-evaluation/tests/unittests/test_unsupported_tools_validation.py	Adds regression tests ensuring restricted tools are accepted only where intended.
sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_version.py	Bumps version to 1.17.1.
sdk/evaluation/azure-ai-evaluation/CHANGELOG.md	Documents the new behavior under 1.17.1 (Unreleased).

+        # Should not raise EvaluationException; flag flip made this path legal.
+        assert evaluator._validator.validate_eval_input(eval_input) is True


+@pytest.mark.usefixtures("mock_model_config")
+@pytest.mark.unittest
+class TestRestrictedToolValidationLifted:
+    """Validator should no longer reject restricted tools for these three evaluators."""


+def _restricted_tool_definition(tool_name: str):
+    return {
+        "name": tool_name,
+        "description": f"Built-in {tool_name} tool.",
+        "parameters": {
+            "type": "object",
+            "properties": {"query": {"type": "string"}},
+        },
+    }


+RESTRICTED_TOOL_NAMES = [
+    "bing_grounding",
+    "bing_custom_search",
+    "azure_ai_search",
+    "azure_fabric",
+    "sharepoint_grounding",
+]


When any tool_call or tool_result in the response carries a known-failure status (failed, error, incomplete, cancelled/canceled), short-circuit _do_eval to return a deterministic fail result (score=0, _passed=False, _result='fail') without invoking the LLM. The evaluator's scoring contract is explicitly binary -- 'FALSE: at least one tool call failed' -- and the prompty rubric does not consider the status field, so it would otherwise grade only the (typically empty) result body and frequently mis-score failed conversations as passes. Reuses the existing pre-flow short-circuit pattern (_is_intermediate_response / _return_not_applicable_result) for consistency. Status is only populated by upstream converters that preserve it; absent status, behavior is unchanged. Bumps to 1.17.1, adds CHANGELOG entry, and adds 19 focused unit tests.

mmkawale requested a review from a team as a code owner June 5, 2026 17:08

Copilot AI review requested due to automatic review settings June 5, 2026 17:08

github-actions Bot added Community Contribution Community members are working on the issue customer-reported Issues that are reported by GitHub users external to the Azure organization. Evaluation Issues related to the client library for Azure AI Evaluation labels Jun 5, 2026

Copilot AI reviewed Jun 5, 2026

View reviewed changes

mmkawale mentioned this pull request Jun 5, 2026

Enable ToolCallAccuracy/InputAccuracy/CallSuccess on restricted-tool conversations Azure/azureml-assets#5117

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Evaluation] Enable ToolCallAccuracy/Input/Success on restricted-tool conversations#47369

[Evaluation] Enable ToolCallAccuracy/Input/Success on restricted-tool conversations#47369
mmkawale wants to merge 2 commits into
Azure:mainfrom
mmkawale:mk/enable-tool-evals-1

mmkawale commented Jun 5, 2026

Uh oh!

github-actions Bot commented Jun 5, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		# Should not raise EvaluationException; flag flip made this path legal.
		assert evaluator._validator.validate_eval_input(eval_input) is True

Conversation

mmkawale commented Jun 5, 2026

Summary

Implementation

Tests

Versioning

Related

Uh oh!

github-actions Bot commented Jun 5, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants