Problem
Two related issues with how AgentV records and consumes provider output:
1. Provider-specific tool naming leaks into evaluators
The skill-trigger evaluator (packages/core/src/evaluation/evaluators/skill-trigger.ts) carries ~73 lines of provider-specific tool-name matching logic because each provider represents the same tool differently:
| Provider |
Skill invocation |
File read |
Skill input field |
Extra patterns |
| Claude |
tool: "Skill" |
tool: "Read" |
input.skill |
— |
| Copilot |
tool: "Skill" or "skill" |
tool: "Read File", "readFile", "Read", "readTextFile" |
input.skill |
"Using skill: X" encoded in tool name |
| Codex |
— |
tool: "command_execution" |
input.command |
"mcp:<server>/<skill>" encoded in tool name |
| Pi |
— |
tool: "read" |
input.path |
— |
Every new provider requires a new matcher. The same problem will affect any future evaluator that inspects tool calls (tool-trajectory, cost-per-tool, etc.).
Root cause: Providers preserve native tool naming in ToolCall.tool. Normalization burden falls on every downstream consumer instead of being handled once at the provider boundary.
2. Raw event stream logs are the default output, not transcripts
Providers write raw per-event stream logs (one line per agent_message_chunk, tool_call, etc.) as the primary record of what happened. The normalized transcript format (TranscriptJsonLine) — which consolidates messages, pairs tool calls, and includes token usage — is only available via a separate agentv import step. This means:
Current naming is misleading: log_format: json sounds like "output as JSON" but actually means "raw event stream." log_format: summary sounds like a reduced version but is actually the more useful consolidated format. Neither produces a proper transcript.
Prior art: code-insights normalizes all provider sessions into a unified schema at the import/sync layer. Downstream consumers never see provider-specific naming or raw events.
Solution
Phase 1: Normalize ToolCall at the provider/import layer
Create a shared normalization function that maps provider-native tool names and input fields to canonical values. Apply it at the point where each provider constructs ToolCall objects.
Canonical tool names (use Claude's naming as the canonical set since it's already the cleanest):
Canonical tool |
Provider-native variants |
"Skill" |
"skill", "Using skill: X" (Copilot prefix), "mcp:<server>/<name>" (Codex) |
"Read" |
"Read File", "readFile", "readTextFile", "read", "Viewing X" (Copilot prefix) |
"Write" |
"writeTextFile", "Write File" |
"Edit" |
"editFile", "Edit File" |
"Bash" |
"command_execution" (Codex, when not a read), "runTerminalCommand" |
For tool names not in this table, pass through unchanged — normalization only covers the tools that evaluators need to reason about.
Canonical input fields — normalize alongside the tool name:
- Skill: ensure
input.skill contains the skill name (extract from tool name prefix if needed)
- Read: ensure
input.file_path contains the path (copy from input.path if that's what the provider uses)
Implementation:
Add normalizeToolCall(providerKind: ProviderKind, tc: ToolCall): ToolCall in a new file packages/core/src/evaluation/providers/normalize-tool-call.ts. This is a pure function — provider kind in, canonical ToolCall out. Use a static mapping table (not if/else chains) so adding a new provider means adding entries, not logic.
Call sites — apply normalization where each provider/parser constructs ToolCall objects:
| File |
Where to normalize |
providers/copilot-cli.ts |
sessionUpdate handler where completedToolCalls.push(...) |
providers/copilot-sdk.ts |
Same pattern |
providers/codex.ts |
ToolCall construction (lines ~235-258) |
providers/pi-cli.ts |
extractMessages() |
providers/copilot-log-parser.ts |
Tool name from req.name ?? req.toolName |
import/claude-parser.ts |
tool_use block parsing — Claude is already canonical, but apply for consistency |
import/codex-parser.ts |
function_call / custom_tool_call cases |
Claude providers (claude-cli.ts, claude-sdk.ts) already emit canonical names. Still wire them through normalizeToolCall for safety, but expect it to be a no-op.
Phase 2: Auto-write transcript JSONL, deprecate log_format
Always write a normalized transcript as a byproduct of every eval run. This is the TranscriptJsonLine format (already defined in packages/core/src/import/types.ts) — consolidated messages with canonical tool names, token usage, duration, and cost.
Deprecate log_format (hard, two-release):
| Release |
Behavior |
| v4.15 (this change) |
log_format still accepted but emits a CLI deprecation warning. New field stream_log introduced. Transcript JSONL always written alongside any stream log. |
| v4.16 (next release) |
log_format removed. stream_log is the only option. |
New target config field: stream_log
| Value |
Behavior |
false (default) |
No stream log. Only the transcript JSONL is written. |
"summary" |
Human-readable consolidated lines (current summary format) |
"raw" |
Per-event debug stream (current json format) |
Naming rationale:
stream_log describes what it is — a log of the raw event stream. It's an opt-in debug tool.
- The transcript is not a log format option; it's the eval output. It's always written.
"raw" replaces "json" because the distinction is content (raw events vs consolidated), not serialization (both are text).
Transcript output location: Write to the same log directory (<provider>/ under .agentv/logs/), with a .transcript.jsonl extension alongside any stream log. One transcript line per eval case invocation.
Files to change:
providers/targets.ts — add stream_log field to target config types (lines ~457-552), add deprecation mapping from log_format → stream_log, update normalizeXxxLogFormat functions
providers/targets-validator.ts — add stream_log to allowed fields, warn on log_format
providers/copilot-utils.ts — StreamLoggerOptions.format accepts new values; CopilotStreamLogger writes transcript on close()
providers/types.ts — update TargetConfig wire format type (line ~351)
- All provider files that create
CopilotStreamLogger — pass the new option
apps/cli/src/commands/eval/ — surface deprecation warning in CLI output
Phase 3: Remove provider-specific matchers from skill-trigger
Once Phase 1 lands, delete from skill-trigger.ts:
ToolMatcher interface (lines 26-41)
CLAUDE_MATCHER, COPILOT_MATCHER, PI_CODING_AGENT_MATCHER, CODEX_MATCHER (lines 43-98)
PROVIDER_TOOL_SEMANTICS mapping (lines 104-116)
- Provider-kind lookup in
evaluate() (line 132)
Replace with a single canonical matcher:
function findSkillTrigger(messages: readonly Message[], skillName: string): ToolCall | undefined {
for (const msg of messages) {
for (const tc of msg.toolCalls ?? []) {
if (tc.tool === 'Skill') {
const skill = String((tc.input as Record<string, unknown>)?.skill ?? '');
if (skill.includes(skillName)) return tc;
}
if (tc.tool === 'Read') {
const filePath = String((tc.input as Record<string, unknown>)?.file_path ?? '');
if (filePath.includes(`skills/${skillName}/`)) return tc;
}
}
}
return undefined;
}
skill-trigger stays as a built-in — "did the agent use the right skill?" is a universal eval primitive (design principle #2). It just no longer needs provider awareness.
Non-goals
- Changing the
ToolCall TypeScript interface — only runtime values change, not the type shape
- Normalizing tool
output content — only tool name and input fields
- Retroactively rewriting existing transcript/log files on disk — only newly-generated outputs change
- Removing stream logging capability — it becomes opt-in, not removed
Acceptance criteria
Risks and migration
Hard deprecation of log_format:
- v4.15: accepted with warning, mapped to
stream_log equivalent (json → raw, summary → summary)
- v4.16: removed — targets.yaml with
log_format will fail validation with a clear error message
Breaking change for code-graders that check provider-native tool names. A code-grader checking tc.tool === 'Read File' breaks because it's now tc.tool === 'Read'. Mitigate:
- Document the canonical tool name table in changelog and docs
- The
--transcript wire format (JSONL on disk) also changes for newly-generated transcripts
Not a breaking change for:
skill-trigger eval YAML config (unchanged)
llm-grader, code-grader, or any evaluator that doesn't inspect ToolCall.tool values
- Provider behavior or execution — normalization is cosmetic, applied after the provider runs
Problem
Two related issues with how AgentV records and consumes provider output:
1. Provider-specific tool naming leaks into evaluators
The
skill-triggerevaluator (packages/core/src/evaluation/evaluators/skill-trigger.ts) carries ~73 lines of provider-specific tool-name matching logic because each provider represents the same tool differently:tool: "Skill"tool: "Read"input.skilltool: "Skill"or"skill"tool: "Read File","readFile","Read","readTextFile"input.skill"Using skill: X"encoded in tool nametool: "command_execution"input.command"mcp:<server>/<skill>"encoded in tool nametool: "read"input.pathEvery new provider requires a new matcher. The same problem will affect any future evaluator that inspects tool calls (tool-trajectory, cost-per-tool, etc.).
Root cause: Providers preserve native tool naming in
ToolCall.tool. Normalization burden falls on every downstream consumer instead of being handled once at the provider boundary.2. Raw event stream logs are the default output, not transcripts
Providers write raw per-event stream logs (one line per
agent_message_chunk,tool_call, etc.) as the primary record of what happened. The normalized transcript format (TranscriptJsonLine) — which consolidates messages, pairs tool calls, and includes token usage — is only available via a separateagentv importstep. This means:--transcript) requires a manual import step after every evalCurrent naming is misleading:
log_format: jsonsounds like "output as JSON" but actually means "raw event stream."log_format: summarysounds like a reduced version but is actually the more useful consolidated format. Neither produces a proper transcript.Prior art: code-insights normalizes all provider sessions into a unified schema at the import/sync layer. Downstream consumers never see provider-specific naming or raw events.
Solution
Phase 1: Normalize ToolCall at the provider/import layer
Create a shared normalization function that maps provider-native tool names and input fields to canonical values. Apply it at the point where each provider constructs
ToolCallobjects.Canonical tool names (use Claude's naming as the canonical set since it's already the cleanest):
tool"Skill""skill","Using skill: X"(Copilot prefix),"mcp:<server>/<name>"(Codex)"Read""Read File","readFile","readTextFile","read","Viewing X"(Copilot prefix)"Write""writeTextFile","Write File""Edit""editFile","Edit File""Bash""command_execution"(Codex, when not a read),"runTerminalCommand"For tool names not in this table, pass through unchanged — normalization only covers the tools that evaluators need to reason about.
Canonical input fields — normalize alongside the tool name:
input.skillcontains the skill name (extract from tool name prefix if needed)input.file_pathcontains the path (copy frominput.pathif that's what the provider uses)Implementation:
Add
normalizeToolCall(providerKind: ProviderKind, tc: ToolCall): ToolCallin a new filepackages/core/src/evaluation/providers/normalize-tool-call.ts. This is a pure function — provider kind in, canonical ToolCall out. Use a static mapping table (not if/else chains) so adding a new provider means adding entries, not logic.Call sites — apply normalization where each provider/parser constructs ToolCall objects:
providers/copilot-cli.tssessionUpdatehandler wherecompletedToolCalls.push(...)providers/copilot-sdk.tsproviders/codex.tsproviders/pi-cli.tsextractMessages()providers/copilot-log-parser.tsreq.name ?? req.toolNameimport/claude-parser.tstool_useblock parsing — Claude is already canonical, but apply for consistencyimport/codex-parser.tsfunction_call/custom_tool_callcasesClaude providers (
claude-cli.ts,claude-sdk.ts) already emit canonical names. Still wire them throughnormalizeToolCallfor safety, but expect it to be a no-op.Phase 2: Auto-write transcript JSONL, deprecate
log_formatAlways write a normalized transcript as a byproduct of every eval run. This is the
TranscriptJsonLineformat (already defined inpackages/core/src/import/types.ts) — consolidated messages with canonical tool names, token usage, duration, and cost.Deprecate
log_format(hard, two-release):log_formatstill accepted but emits a CLI deprecation warning. New fieldstream_logintroduced. Transcript JSONL always written alongside any stream log.log_formatremoved.stream_logis the only option.New target config field:
stream_logfalse(default)"summary""raw"Naming rationale:
stream_logdescribes what it is — a log of the raw event stream. It's an opt-in debug tool."raw"replaces"json"because the distinction is content (raw events vs consolidated), not serialization (both are text).Transcript output location: Write to the same log directory (
<provider>/under.agentv/logs/), with a.transcript.jsonlextension alongside any stream log. One transcript line per eval case invocation.Files to change:
providers/targets.ts— addstream_logfield to target config types (lines ~457-552), add deprecation mapping fromlog_format→stream_log, updatenormalizeXxxLogFormatfunctionsproviders/targets-validator.ts— addstream_logto allowed fields, warn onlog_formatproviders/copilot-utils.ts—StreamLoggerOptions.formataccepts new values;CopilotStreamLoggerwrites transcript onclose()providers/types.ts— updateTargetConfigwire format type (line ~351)CopilotStreamLogger— pass the new optionapps/cli/src/commands/eval/— surface deprecation warning in CLI outputPhase 3: Remove provider-specific matchers from skill-trigger
Once Phase 1 lands, delete from
skill-trigger.ts:ToolMatcherinterface (lines 26-41)CLAUDE_MATCHER,COPILOT_MATCHER,PI_CODING_AGENT_MATCHER,CODEX_MATCHER(lines 43-98)PROVIDER_TOOL_SEMANTICSmapping (lines 104-116)evaluate()(line 132)Replace with a single canonical matcher:
skill-trigger stays as a built-in — "did the agent use the right skill?" is a universal eval primitive (design principle #2). It just no longer needs provider awareness.
Non-goals
ToolCallTypeScript interface — only runtime values change, not the type shapeoutputcontent — onlytoolname and input fieldsAcceptance criteria
normalizeToolCall()exists with a static mapping table and unit tests covering each provider's native namesnormalizeToolCall()before returning ToolCall objects.transcript.jsonlfile with consolidatedTranscriptJsonLineentrieslog_formatemits a deprecation warning pointing tostream_logstream_log: false(default) writes no stream log — only the transcriptstream_log: rawproduces the old per-event output;stream_log: summaryproduces consolidated linesskill-triggerevaluator has zero provider-specific codeskill-triggertests pass against canonical names only--transcriptevals produce identical results whether transcript was from Claude, Copilot, or Codextype: skill-triggercontinue working unchangedRisks and migration
Hard deprecation of
log_format:stream_logequivalent (json→raw,summary→summary)log_formatwill fail validation with a clear error messageBreaking change for code-graders that check provider-native tool names. A code-grader checking
tc.tool === 'Read File'breaks because it's nowtc.tool === 'Read'. Mitigate:--transcriptwire format (JSONL on disk) also changes for newly-generated transcriptsNot a breaking change for:
skill-triggereval YAML config (unchanged)llm-grader,code-grader, or any evaluator that doesn't inspectToolCall.toolvalues