Skip to content

[ai-projects] Self-contain 4 trace-based eval samples#47322

Draft
aprilk-ms wants to merge 6 commits into
mainfrom
users/aprilk/sample-self-contained-trace-evals
Draft

[ai-projects] Self-contain 4 trace-based eval samples#47322
aprilk-ms wants to merge 6 commits into
mainfrom
users/aprilk/sample-self-contained-trace-evals

Conversation

@aprilk-ms
Copy link
Copy Markdown
Member

[ai-projects] Self-contain 4 trace-based eval samples

Summary

Rewrites four samples/evaluations/ trace samples to be fully
self-contained
, mirroring the proven pattern from the merged
sample_dataset_generation_job_traces_for_evaluation.py (PR #47250):

  1. sample_multiturn_trace_evaluation_by_id.py
  2. sample_multiturn_trace_evaluation_agent_filter.py
  3. sample_agent_trace_evaluation_smart_filter.py
  4. sample_evaluations_builtin_with_traces.py

Each sample now creates a transient agent, seeds the conversations it
needs (single- or multi-turn), waits for App Insights ingestion, runs
the evaluation with retry-on-empty, and unconditionally cleans up the
agent + conversations + eval object in finally.

Environment requirements

Before: FOUNDRY_PROJECT_ENDPOINT, FOUNDRY_MODEL_NAME, plus
one of
FOUNDRY_AGENT_NAME, APPINSIGHTS_RESOURCE_ID, or
AGENT_ID depending on the sample, plus a pre-existing seeded agent
in the project.

After: only FOUNDRY_PROJECT_ENDPOINT and FOUNDRY_MODEL_NAME.
No external setup, no leftovers.

Service constraints captured as named constants

While bringing each sample up I hit (and documented) the current
trace-eval service constraints:

Constraint Constant Where
agent_filter requires end_time - start_time >= 15 min MIN_AGENT_FILTER_WINDOW_SECONDS = 16 * 60 agent_filter, smart_filter
Conversation-level queries skip conversations whose first/last span is within ~5 min of the window edge AGENT_FILTER_EDGE_BUFFER_SECONDS = 6 * 60 agent_filter
filter_strategy="smart_filtering" requires max_traces in [15, 1000] SMART_FILTERING_MIN_MAX_TRACES = 15 smart_filter (auto-bumps)
azure_ai_traces evaluators read from {{item.*}} (not {{sample.*}}) n/a — fixed in _build_evaluator builtin_with_traces

Verification

Each sample was live-run against bugbash-westus2 with gpt-4.1
before its commit:

Sample Attempt Result
multiturn_trace_evaluation_by_id 1 3/3 passed
multiturn_trace_evaluation_agent_filter 1 (after the two constraint fixes above) 3/3 passed
agent_trace_evaluation_smart_filter 1 (after max_traces auto-bump) 5/5 passed
evaluations_builtin_with_traces 1 (after item.* mapping fix) 5/5 passed

Notes for reviewers

  • Cleanup is best-effort (try/except) so a partial failure mid-run
    still tidies up.
  • The samples deliberately seed enough conversations (3–5) to keep
    runtime modest while still exercising multi-conversation behavior.
  • The lookback_hours=1 / max_traces=5 defaults on
    builtin_with_traces keep the trace window small even when the
    project has unrelated traffic.

Co-authored-by: Copilot 223556219+Copilot@users.noreply.github.com

aprilk-ms and others added 5 commits June 3, 2026 11:14
Replace FOUNDRY_CONVERSATION_IDS / FOUNDRY_TRACE_IDS prerequisites with an inline seed step: create a transient agent, seed 3 multi-turn conversations against it, then evaluate them by Foundry conversation ID. Retry the eval run if Application Insights ingestion is still in flight. Best-effort cleanup of the eval, seeded conversations, and agent.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…lter

Rewrite the multi-turn agent_filter trace evaluation sample to be self-contained:
create a transient agent, seed 3 multi-turn conversations, wait for trace
ingestion, then evaluate using agent_filter narrowed to the seeded agent.

Key service constraints baked in as constants:
  - agent_filter requires end_time - start_time >= 15 min
  - conversation-level queries exclude conversations whose first/last span
    is within 5 min of either window edge

No external state required (no FOUNDRY_AGENT_NAME env var, no pre-existing
trace data). Verified end-to-end against bugbash-westus2/gpt-4.1: 3/3 passed
on first attempt.
Rewrite the single-turn smart_filter trace evaluation sample to be
self-contained: create a transient agent, seed 5 single-turn prompts,
wait for trace ingestion, then evaluate with agent_filter +
filter_strategy='smart_filtering'.

Key service constraints baked in as constants:
  - agent_filter requires end_time - start_time >= 15 min
  - queries exclude traces whose first/last span is within 5 min of either
    window edge
  - smart_filtering requires max_traces in [15, 1000] (sample auto-bumps
    --max-traces if needed)

No external state required (no FOUNDRY_AGENT_NAME env var, no pre-existing
trace data). Verified end-to-end against bugbash-westus2/gpt-4.1: 5/5 passed
on first attempt.
Rewrite the builtin-evaluators-with-traces sample to be self-contained:
seed a transient agent + conversations, wait for App Insights ingestion,
then evaluate via the azure_ai_traces data source (agent_id-resolved
trace lookup) with retry-on-empty, and clean up.

Fixes data_mapping to use {{item.*}} so the evaluators receive query,
response, and tool_definitions from each datasource item (matches the
flat shape produced by azure_ai_traces).

Live-tested against bugbash-westus2 with gpt-4.1 (5/5 passed on
attempt 1).
The sample no longer requires APPINSIGHTS_RESOURCE_ID after being made
self-contained. It still cannot be played back because it seeds traces
and waits for real App Insights ingestion.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant