Skip to content

[Evaluation] Converter: bing_custom_search + sharepoint_grounding branches; query/input fallback for AIS/SP/Fabric#47396

Draft
mmkawale wants to merge 5 commits into
Azure:mainfrom
mmkawale:mk/restricted-tool-converter-branches
Draft

[Evaluation] Converter: bing_custom_search + sharepoint_grounding branches; query/input fallback for AIS/SP/Fabric#47396
mmkawale wants to merge 5 commits into
Azure:mainfrom
mmkawale:mk/restricted-tool-converter-branches

Conversation

@mmkawale

@mmkawale mmkawale commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Restricted-tool converter: BCS + SharePoint branches, query/input argument fallback

What this PR does

Extends break_tool_call_into_messages in _converters/_models.py so the three status-only restricted-tool evaluators (ToolCallAccuracyEvaluator, _ToolInputAccuracyEvaluator, _ToolCallSuccessEvaluator) can score conversations involving the two restricted tools the converter currently drops on the floor, and so the AI Search / Fabric / SharePoint argument extraction stops surfacing empty arguments on live traces.

This is the converter half of the Phase 1 "restricted-tool enablement" work and is stacked on top of #47369 (which contains the validator flip, the Tool Call Success status short-circuit, and the _ToolInputAccuracyEvaluator top-level export). Both PRs land in 1.17.1. Until #47369 merges, this PR's diff page will show both PRs' changes; once #47369 is in, the diff cleans up.

Concrete changes

  1. bing_custom_search — new arguments-only branch mirroring bing_grounding. Emits a tool_call with requesturl; no tool_result (Bing-family results are redacted upstream for compliance, so there is nothing to dump). Before: BCS calls were silently dropped because the elif chain ended with else: return messages. After: TCA + TIA score on BCS conversations.
  2. sharepoint_grounding — new arguments + result branch mirroring azure_ai_search. Emits both a tool_call (with the search term) and a tool_result (with the output payload). Before: SP calls were silently dropped. After: TCA + TIA + TCS score on SP, and the tool_result is structurally where the Phase 2 Groundedness / Tool Output Utilization extractor will read from.
  3. query / input argument fallback on AIS, SP, Fabric — each branch now reads details["<tool>"].get("input") or details["<tool>"].get("query") or "" instead of dereferencing ["input"] directly. Live agent traces emit the search term under query (not input) for all three tools, so the previous code was surfacing empty arguments to the evaluators on production conversations. Behavior is unchanged for traces that emit input.
  4. Stale comment refresh — the top-of-function comment was dated "March 17th, 2025" and claimed only custom functions were supported. Replaced with a description of the current branch set.

bing_grounding's output side is intentionally left as the existing return messages early-exit. Tool Call Success therefore continues to return NOT_APPLICABLE on Bing-only conversations (nothing to inspect). Lifting that requires a product decision about what status to assert on a redacted Bing turn — out of scope for this PR.

Coverage after this PR (combined with #47369)

Tool Tool Call Accuracy Tool Input Accuracy Tool Call Success
azure_ai_search
azure_fabric
sharepoint_grounding (new SP branch)
bing_grounding (existing branch, early-return on result) ⚠️ NOT_APPLICABLE (no tool_result to inspect)
bing_custom_search (new BCS mirror, early-return on result) ⚠️ NOT_APPLICABLE (same reason as BG)

Tests

Added 5 new tests in tests/converters/ai_agent_converter/test_ai_agent_converter_internals.py:

  • test_bing_custom_search_tool_calls
  • test_sharepoint_grounding_tool_calls
  • test_sharepoint_grounding_tool_calls_query_key_fallback
  • test_azure_ai_search_tool_calls_query_key_fallback
  • test_fabric_dataagent_tool_calls_query_key_fallback

The new tests construct ToolCall via a small _HybridDict helper instead of going through ToolDecoder, so they don't depend on the agents-SDK RunStep* typed models that have moved between azure.ai.projects.models and azure.ai.agents.models packages. This is also why the new tests run cleanly even in local environments where the existing test_bing_grounding_tool_calls / test_file_search_tool_calls / etc. fail with NameError on the moved models (a pre-existing infra issue, untouched by this PR).

Backward compatibility

  • API surface: unchanged. break_tool_call_into_messages signature, Message / ToolMessage / AssistantMessage shapes are all the same.
  • Output for conversations the converter already handled: identical, with two narrow exceptions tied to the live bug fix (3):
    • AIS / SP / Fabric conversations whose runtime emits input continue to produce identical output (the .get("input") or .get("query") reads "input" first).
    • AIS / SP / Fabric conversations whose runtime emits query previously produced empty arguments (so evaluators couldn't score them); they now produce the populated query value. This is the intentional fix.
  • No removed symbols, no renamed parameters, no changed defaults.

CHANGELOG

Two bullets added under 1.17.1 (Unreleased) > Features Added:

  • "Extended break_tool_call_into_messages ... with explicit branches for bing_custom_search and sharepoint_grounding ..."
  • "Made the per-tool argument extraction ... resilient to the query vs input runtime drift ..."

Related

manaskawale and others added 3 commits June 5, 2026 08:16
These three evaluators grade the agent's tool selection, input arguments,
and call status -- none consume the (redacted) tool output body -- so the
previous unconditional rejection of conversations containing built-in
restricted tools (bing_grounding, bing_custom_search, azure_ai_search,
azure_fabric, sharepoint_grounding) is now lifted.

Implementation:
- Set check_for_unsupported_tools=False on each evaluator's input validator
  in _tool_call_accuracy.py, _tool_input_accuracy.py, _tool_call_success.py.
- The underlying ToolDefinitionsValidator / ToolCallsValidator classes are
  unchanged; GroundednessEvaluator and ToolOutputUtilizationEvaluator still
  reject restricted tools because they require the tool output body.

Tests:
- New test_unsupported_tools_validation.py (26 tests) covers:
  * 15 parametrized cases: each of the 3 evaluators x 5 restricted tools,
    asserting validate_eval_input returns True for response= payloads.
  * 1 mixed-tools case.
  * 10 regression cases asserting the underlying validators still reject
    restricted tools when check_for_unsupported_tools=True.

Versioning:
- Bumped _version.py 1.17.0 -> 1.17.1.
- Added 1.17.1 (Unreleased) section to CHANGELOG.md under Features Added.
When any tool_call or tool_result in the response carries a known-failure status (failed, error, incomplete, cancelled/canceled), short-circuit _do_eval to return a deterministic fail result (score=0, _passed=False, _result='fail') without invoking the LLM. The evaluator's scoring contract is explicitly binary -- 'FALSE: at least one tool call failed' -- and the prompty rubric does not consider the status field, so it would otherwise grade only the (typically empty) result body and frequently mis-score failed conversations as passes.

Reuses the existing pre-flow short-circuit pattern (_is_intermediate_response / _return_not_applicable_result) for consistency. Status is only populated by upstream converters that preserve it; absent status, behavior is unchanged. Bumps to 1.17.1, adds CHANGELOG entry, and adds 19 focused unit tests.
… namespace

Brings _ToolInputAccuracyEvaluator in line with its three sibling tool evaluators (ToolCallAccuracyEvaluator, _ToolCallSuccessEvaluator, _ToolOutputUtilizationEvaluator) which are already exposed on the top-level package. Consumers (notably the Foundry evaluations service catalog) can now import it from azure.ai.evaluation directly instead of reaching into the private _evaluators._tool_input_accuracy submodule.
@github-actions github-actions Bot added Community Contribution Community members are working on the issue customer-reported Issues that are reported by GitHub users external to the Azure organization. Evaluation Issues related to the client library for Azure AI Evaluation labels Jun 8, 2026
@github-actions

github-actions Bot commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Thank you for your contribution @mmkawale! We will review the pull request and get back to you soon.

…ery/input fallback for AIS, SP, Fabric

break_tool_call_into_messages previously had no elif branch for bing_custom_search or sharepoint_grounding, so calls touching either tool were silently dropped before any evaluator could see them. The three status-only tool evaluators (ToolCallAccuracyEvaluator, _ToolInputAccuracyEvaluator, _ToolCallSuccessEvaluator) therefore returned NOT_APPLICABLE on those conversations even after the validator was loosened in PR Azure#47369.

Changes:

- bing_custom_search: arguments-only branch mirroring bing_grounding (emits a tool_call with the requesturl; no tool_result, since Bing-family results are redacted upstream for compliance).

- sharepoint_grounding: arguments + dumped output, mirroring azure_ai_search. Phase 2 will extend the Groundedness extractor to walk the documents structure already present on the tool_result.

- azure_ai_search, sharepoint_grounding, fabric_dataagent input branches: switched from direct details[<tool>][input] dereference to .get(input) or .get(query) or empty-string fallback. Live agent traces emit the search term under 'query' for all three, which made the existing AIS and Fabric branches surface empty arguments to evaluators (a live bug, not just a Phase 1 prerequisite).

- Refreshed the stale March-2025 top-of-function comment to reflect the current set of supported built-ins.

Tests:

Added 5 new tests in tests/converters/ai_agent_converter/test_ai_agent_converter_internals.py covering bing_custom_search, sharepoint_grounding (input key and output dump), and the query-key fallback for AIS, SP, and Fabric. The new tests construct ToolCall via a small _HybridDict helper instead of going through ToolDecoder, so they do not depend on the agents SDK RunStep* models that have moved between azure.ai.projects.models and azure.ai.agents.models packages.
@mmkawale mmkawale force-pushed the mk/restricted-tool-converter-branches branch from 0e93060 to 5445c22 Compare June 8, 2026 19:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Community Contribution Community members are working on the issue customer-reported Issues that are reported by GitHub users external to the Azure organization. Evaluation Issues related to the client library for Azure AI Evaluation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants