[ai-projects] Self-contain 4 trace-based eval samples by aprilk-ms · Pull Request #47322 · Azure/azure-sdk-for-python

aprilk-ms · 2026-06-03T18:40:53Z

[ai-projects] Self-contain 4 trace-based eval samples

Summary

Rewrites four samples/evaluations/ trace samples to be fully
self-contained, mirroring the proven pattern from the merged
sample_dataset_generation_job_traces_for_evaluation.py (PR #47250):

sample_multiturn_trace_evaluation_by_id.py
sample_multiturn_trace_evaluation_agent_filter.py
sample_agent_trace_evaluation_smart_filter.py
sample_evaluations_builtin_with_traces.py

Each sample now creates a transient agent, seeds the conversations it
needs (single- or multi-turn), waits for App Insights ingestion, runs
the evaluation with retry-on-empty, and unconditionally cleans up the
agent + conversations + eval object in finally.

Environment requirements

Before: FOUNDRY_PROJECT_ENDPOINT, FOUNDRY_MODEL_NAME, plus
one of FOUNDRY_AGENT_NAME, APPINSIGHTS_RESOURCE_ID, or
AGENT_ID depending on the sample, plus a pre-existing seeded agent
in the project.

After: only FOUNDRY_PROJECT_ENDPOINT and FOUNDRY_MODEL_NAME.
No external setup, no leftovers.

Service constraints captured as named constants

While bringing each sample up I hit (and documented) the current
trace-eval service constraints:

Constraint	Constant	Where
`agent_filter` requires `end_time - start_time >= 15 min`	`MIN_AGENT_FILTER_WINDOW_SECONDS = 16 * 60`	`agent_filter`, `smart_filter`
Conversation-level queries skip conversations whose first/last span is within ~5 min of the window edge	`AGENT_FILTER_EDGE_BUFFER_SECONDS = 6 * 60`	`agent_filter`
`filter_strategy="smart_filtering"` requires `max_traces` in `[15, 1000]`	`SMART_FILTERING_MIN_MAX_TRACES = 15`	`smart_filter` (auto-bumps)
`azure_ai_traces` evaluators read from `{{item.}}` (not `{{sample.}}`)	n/a — fixed in `_build_evaluator`	`builtin_with_traces`

Verification

Each sample was live-run against bugbash-westus2 with gpt-4.1
before its commit:

Sample	Attempt	Result
`multiturn_trace_evaluation_by_id`	1	3/3 passed
`multiturn_trace_evaluation_agent_filter`	1 (after the two constraint fixes above)	3/3 passed
`agent_trace_evaluation_smart_filter`	1 (after `max_traces` auto-bump)	5/5 passed
`evaluations_builtin_with_traces`	1 (after `item.*` mapping fix)	5/5 passed

Notes for reviewers

Cleanup is best-effort (try/except) so a partial failure mid-run
still tidies up.
The samples deliberately seed enough conversations (3–5) to keep
runtime modest while still exercising multi-conversation behavior.
The lookback_hours=1 / max_traces=5 defaults on
builtin_with_traces keep the trace window small even when the
project has unrelated traffic.

Co-authored-by: Copilot 223556219+Copilot@users.noreply.github.com

Replace FOUNDRY_CONVERSATION_IDS / FOUNDRY_TRACE_IDS prerequisites with an inline seed step: create a transient agent, seed 3 multi-turn conversations against it, then evaluate them by Foundry conversation ID. Retry the eval run if Application Insights ingestion is still in flight. Best-effort cleanup of the eval, seeded conversations, and agent. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…lter Rewrite the multi-turn agent_filter trace evaluation sample to be self-contained: create a transient agent, seed 3 multi-turn conversations, wait for trace ingestion, then evaluate using agent_filter narrowed to the seeded agent. Key service constraints baked in as constants: - agent_filter requires end_time - start_time >= 15 min - conversation-level queries exclude conversations whose first/last span is within 5 min of either window edge No external state required (no FOUNDRY_AGENT_NAME env var, no pre-existing trace data). Verified end-to-end against bugbash-westus2/gpt-4.1: 3/3 passed on first attempt.

Rewrite the single-turn smart_filter trace evaluation sample to be self-contained: create a transient agent, seed 5 single-turn prompts, wait for trace ingestion, then evaluate with agent_filter + filter_strategy='smart_filtering'. Key service constraints baked in as constants: - agent_filter requires end_time - start_time >= 15 min - queries exclude traces whose first/last span is within 5 min of either window edge - smart_filtering requires max_traces in [15, 1000] (sample auto-bumps --max-traces if needed) No external state required (no FOUNDRY_AGENT_NAME env var, no pre-existing trace data). Verified end-to-end against bugbash-westus2/gpt-4.1: 5/5 passed on first attempt.

Rewrite the builtin-evaluators-with-traces sample to be self-contained: seed a transient agent + conversations, wait for App Insights ingestion, then evaluate via the azure_ai_traces data source (agent_id-resolved trace lookup) with retry-on-empty, and clean up. Fixes data_mapping to use {{item.*}} so the evaluators receive query, response, and tool_definitions from each datasource item (matches the flat shape produced by azure_ai_traces). Live-tested against bugbash-westus2 with gpt-4.1 (5/5 passed on attempt 1).

The sample no longer requires APPINSIGHTS_RESOURCE_ID after being made self-contained. It still cannot be played back because it seeds traces and waits for real App Insights ingestion.

aprilk-ms and others added 5 commits June 3, 2026 11:14

[ai-projects] CHANGELOG: note self-contained trace eval samples

9c5886a

github-actions Bot added the AI Projects label Jun 3, 2026

[ai-projects] Update stale skip reason for builtin_with_traces sample

417b927

The sample no longer requires APPINSIGHTS_RESOURCE_ID after being made self-contained. It still cannot be played back because it seeds traces and waits for real App Insights ingestion.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ai-projects] Self-contain 4 trace-based eval samples#47322

[ai-projects] Self-contain 4 trace-based eval samples#47322
aprilk-ms wants to merge 6 commits into
mainfrom
users/aprilk/sample-self-contained-trace-evals

aprilk-ms commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

aprilk-ms commented Jun 3, 2026

Summary

Environment requirements

Service constraints captured as named constants

Verification

Notes for reviewers

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant