[ai-projects] Self-contain 4 trace-based eval samples#47322
Draft
aprilk-ms wants to merge 6 commits into
Draft
Conversation
Replace FOUNDRY_CONVERSATION_IDS / FOUNDRY_TRACE_IDS prerequisites with an inline seed step: create a transient agent, seed 3 multi-turn conversations against it, then evaluate them by Foundry conversation ID. Retry the eval run if Application Insights ingestion is still in flight. Best-effort cleanup of the eval, seeded conversations, and agent. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…lter
Rewrite the multi-turn agent_filter trace evaluation sample to be self-contained:
create a transient agent, seed 3 multi-turn conversations, wait for trace
ingestion, then evaluate using agent_filter narrowed to the seeded agent.
Key service constraints baked in as constants:
- agent_filter requires end_time - start_time >= 15 min
- conversation-level queries exclude conversations whose first/last span
is within 5 min of either window edge
No external state required (no FOUNDRY_AGENT_NAME env var, no pre-existing
trace data). Verified end-to-end against bugbash-westus2/gpt-4.1: 3/3 passed
on first attempt.
Rewrite the single-turn smart_filter trace evaluation sample to be
self-contained: create a transient agent, seed 5 single-turn prompts,
wait for trace ingestion, then evaluate with agent_filter +
filter_strategy='smart_filtering'.
Key service constraints baked in as constants:
- agent_filter requires end_time - start_time >= 15 min
- queries exclude traces whose first/last span is within 5 min of either
window edge
- smart_filtering requires max_traces in [15, 1000] (sample auto-bumps
--max-traces if needed)
No external state required (no FOUNDRY_AGENT_NAME env var, no pre-existing
trace data). Verified end-to-end against bugbash-westus2/gpt-4.1: 5/5 passed
on first attempt.
Rewrite the builtin-evaluators-with-traces sample to be self-contained:
seed a transient agent + conversations, wait for App Insights ingestion,
then evaluate via the azure_ai_traces data source (agent_id-resolved
trace lookup) with retry-on-empty, and clean up.
Fixes data_mapping to use {{item.*}} so the evaluators receive query,
response, and tool_definitions from each datasource item (matches the
flat shape produced by azure_ai_traces).
Live-tested against bugbash-westus2 with gpt-4.1 (5/5 passed on
attempt 1).
The sample no longer requires APPINSIGHTS_RESOURCE_ID after being made self-contained. It still cannot be played back because it seeds traces and waits for real App Insights ingestion.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
[ai-projects] Self-contain 4 trace-based eval samples
Summary
Rewrites four
samples/evaluations/trace samples to be fullyself-contained, mirroring the proven pattern from the merged
sample_dataset_generation_job_traces_for_evaluation.py(PR #47250):sample_multiturn_trace_evaluation_by_id.pysample_multiturn_trace_evaluation_agent_filter.pysample_agent_trace_evaluation_smart_filter.pysample_evaluations_builtin_with_traces.pyEach sample now creates a transient agent, seeds the conversations it
needs (single- or multi-turn), waits for App Insights ingestion, runs
the evaluation with retry-on-empty, and unconditionally cleans up the
agent + conversations + eval object in
finally.Environment requirements
Before:
FOUNDRY_PROJECT_ENDPOINT,FOUNDRY_MODEL_NAME, plusone of
FOUNDRY_AGENT_NAME,APPINSIGHTS_RESOURCE_ID, orAGENT_IDdepending on the sample, plus a pre-existing seeded agentin the project.
After: only
FOUNDRY_PROJECT_ENDPOINTandFOUNDRY_MODEL_NAME.No external setup, no leftovers.
Service constraints captured as named constants
While bringing each sample up I hit (and documented) the current
trace-eval service constraints:
agent_filterrequiresend_time - start_time >= 15 minMIN_AGENT_FILTER_WINDOW_SECONDS = 16 * 60agent_filter,smart_filterAGENT_FILTER_EDGE_BUFFER_SECONDS = 6 * 60agent_filterfilter_strategy="smart_filtering"requiresmax_tracesin[15, 1000]SMART_FILTERING_MIN_MAX_TRACES = 15smart_filter(auto-bumps)azure_ai_tracesevaluators read from{{item.*}}(not{{sample.*}})_build_evaluatorbuiltin_with_tracesVerification
Each sample was live-run against
bugbash-westus2withgpt-4.1before its commit:
multiturn_trace_evaluation_by_idmultiturn_trace_evaluation_agent_filteragent_trace_evaluation_smart_filtermax_tracesauto-bump)evaluations_builtin_with_tracesitem.*mapping fix)Notes for reviewers
try/except) so a partial failure mid-runstill tidies up.
runtime modest while still exercising multi-conversation behavior.
lookback_hours=1/max_traces=5defaults onbuiltin_with_traceskeep the trace window small even when theproject has unrelated traffic.
Co-authored-by: Copilot 223556219+Copilot@users.noreply.github.com