Skip to content

refactor: normalize ToolCall names at provider layer, auto-write transcript, deprecate log_format #1059

@christso

Description

@christso

Problem

Two related issues with how AgentV records and consumes provider output:

1. Provider-specific tool naming leaks into evaluators

The skill-trigger evaluator (packages/core/src/evaluation/evaluators/skill-trigger.ts) carries ~73 lines of provider-specific tool-name matching logic because each provider represents the same tool differently:

Provider Skill invocation File read Skill input field Extra patterns
Claude tool: "Skill" tool: "Read" input.skill
Copilot tool: "Skill" or "skill" tool: "Read File", "readFile", "Read", "readTextFile" input.skill "Using skill: X" encoded in tool name
Codex tool: "command_execution" input.command "mcp:<server>/<skill>" encoded in tool name
Pi tool: "read" input.path

Every new provider requires a new matcher. The same problem will affect any future evaluator that inspects tool calls (tool-trajectory, cost-per-tool, etc.).

Root cause: Providers preserve native tool naming in ToolCall.tool. Normalization burden falls on every downstream consumer instead of being handled once at the provider boundary.

2. Raw event stream logs are the default output, not transcripts

Providers write raw per-event stream logs (one line per agent_message_chunk, tool_call, etc.) as the primary record of what happened. The normalized transcript format (TranscriptJsonLine) — which consolidates messages, pairs tool calls, and includes token usage — is only available via a separate agentv import step. This means:

Current naming is misleading: log_format: json sounds like "output as JSON" but actually means "raw event stream." log_format: summary sounds like a reduced version but is actually the more useful consolidated format. Neither produces a proper transcript.

Prior art: code-insights normalizes all provider sessions into a unified schema at the import/sync layer. Downstream consumers never see provider-specific naming or raw events.

Solution

Phase 1: Normalize ToolCall at the provider/import layer

Create a shared normalization function that maps provider-native tool names and input fields to canonical values. Apply it at the point where each provider constructs ToolCall objects.

Canonical tool names (use Claude's naming as the canonical set since it's already the cleanest):

Canonical tool Provider-native variants
"Skill" "skill", "Using skill: X" (Copilot prefix), "mcp:<server>/<name>" (Codex)
"Read" "Read File", "readFile", "readTextFile", "read", "Viewing X" (Copilot prefix)
"Write" "writeTextFile", "Write File"
"Edit" "editFile", "Edit File"
"Bash" "command_execution" (Codex, when not a read), "runTerminalCommand"

For tool names not in this table, pass through unchanged — normalization only covers the tools that evaluators need to reason about.

Canonical input fields — normalize alongside the tool name:

  • Skill: ensure input.skill contains the skill name (extract from tool name prefix if needed)
  • Read: ensure input.file_path contains the path (copy from input.path if that's what the provider uses)

Implementation:

Add normalizeToolCall(providerKind: ProviderKind, tc: ToolCall): ToolCall in a new file packages/core/src/evaluation/providers/normalize-tool-call.ts. This is a pure function — provider kind in, canonical ToolCall out. Use a static mapping table (not if/else chains) so adding a new provider means adding entries, not logic.

Call sites — apply normalization where each provider/parser constructs ToolCall objects:

File Where to normalize
providers/copilot-cli.ts sessionUpdate handler where completedToolCalls.push(...)
providers/copilot-sdk.ts Same pattern
providers/codex.ts ToolCall construction (lines ~235-258)
providers/pi-cli.ts extractMessages()
providers/copilot-log-parser.ts Tool name from req.name ?? req.toolName
import/claude-parser.ts tool_use block parsing — Claude is already canonical, but apply for consistency
import/codex-parser.ts function_call / custom_tool_call cases

Claude providers (claude-cli.ts, claude-sdk.ts) already emit canonical names. Still wire them through normalizeToolCall for safety, but expect it to be a no-op.

Phase 2: Auto-write transcript JSONL, deprecate log_format

Always write a normalized transcript as a byproduct of every eval run. This is the TranscriptJsonLine format (already defined in packages/core/src/import/types.ts) — consolidated messages with canonical tool names, token usage, duration, and cost.

Deprecate log_format (hard, two-release):

Release Behavior
v4.15 (this change) log_format still accepted but emits a CLI deprecation warning. New field stream_log introduced. Transcript JSONL always written alongside any stream log.
v4.16 (next release) log_format removed. stream_log is the only option.

New target config field: stream_log

Value Behavior
false (default) No stream log. Only the transcript JSONL is written.
"summary" Human-readable consolidated lines (current summary format)
"raw" Per-event debug stream (current json format)

Naming rationale:

  • stream_log describes what it is — a log of the raw event stream. It's an opt-in debug tool.
  • The transcript is not a log format option; it's the eval output. It's always written.
  • "raw" replaces "json" because the distinction is content (raw events vs consolidated), not serialization (both are text).

Transcript output location: Write to the same log directory (<provider>/ under .agentv/logs/), with a .transcript.jsonl extension alongside any stream log. One transcript line per eval case invocation.

Files to change:

  • providers/targets.ts — add stream_log field to target config types (lines ~457-552), add deprecation mapping from log_formatstream_log, update normalizeXxxLogFormat functions
  • providers/targets-validator.ts — add stream_log to allowed fields, warn on log_format
  • providers/copilot-utils.tsStreamLoggerOptions.format accepts new values; CopilotStreamLogger writes transcript on close()
  • providers/types.ts — update TargetConfig wire format type (line ~351)
  • All provider files that create CopilotStreamLogger — pass the new option
  • apps/cli/src/commands/eval/ — surface deprecation warning in CLI output

Phase 3: Remove provider-specific matchers from skill-trigger

Once Phase 1 lands, delete from skill-trigger.ts:

  • ToolMatcher interface (lines 26-41)
  • CLAUDE_MATCHER, COPILOT_MATCHER, PI_CODING_AGENT_MATCHER, CODEX_MATCHER (lines 43-98)
  • PROVIDER_TOOL_SEMANTICS mapping (lines 104-116)
  • Provider-kind lookup in evaluate() (line 132)

Replace with a single canonical matcher:

function findSkillTrigger(messages: readonly Message[], skillName: string): ToolCall | undefined {
  for (const msg of messages) {
    for (const tc of msg.toolCalls ?? []) {
      if (tc.tool === 'Skill') {
        const skill = String((tc.input as Record<string, unknown>)?.skill ?? '');
        if (skill.includes(skillName)) return tc;
      }
      if (tc.tool === 'Read') {
        const filePath = String((tc.input as Record<string, unknown>)?.file_path ?? '');
        if (filePath.includes(`skills/${skillName}/`)) return tc;
      }
    }
  }
  return undefined;
}

skill-trigger stays as a built-in — "did the agent use the right skill?" is a universal eval primitive (design principle #2). It just no longer needs provider awareness.

Non-goals

  • Changing the ToolCall TypeScript interface — only runtime values change, not the type shape
  • Normalizing tool output content — only tool name and input fields
  • Retroactively rewriting existing transcript/log files on disk — only newly-generated outputs change
  • Removing stream logging capability — it becomes opt-in, not removed

Acceptance criteria

  • normalizeToolCall() exists with a static mapping table and unit tests covering each provider's native names
  • All providers and import parsers call normalizeToolCall() before returning ToolCall objects
  • Every eval run writes a .transcript.jsonl file with consolidated TranscriptJsonLine entries
  • log_format emits a deprecation warning pointing to stream_log
  • stream_log: false (default) writes no stream log — only the transcript
  • stream_log: raw produces the old per-event output; stream_log: summary produces consolidated lines
  • skill-trigger evaluator has zero provider-specific code
  • skill-trigger tests pass against canonical names only
  • --transcript evals produce identical results whether transcript was from Claude, Copilot, or Codex
  • Existing eval YAML files using type: skill-trigger continue working unchanged

Risks and migration

Hard deprecation of log_format:

  • v4.15: accepted with warning, mapped to stream_log equivalent (jsonraw, summarysummary)
  • v4.16: removed — targets.yaml with log_format will fail validation with a clear error message

Breaking change for code-graders that check provider-native tool names. A code-grader checking tc.tool === 'Read File' breaks because it's now tc.tool === 'Read'. Mitigate:

  • Document the canonical tool name table in changelog and docs
  • The --transcript wire format (JSONL on disk) also changes for newly-generated transcripts

Not a breaking change for:

  • skill-trigger eval YAML config (unchanged)
  • llm-grader, code-grader, or any evaluator that doesn't inspect ToolCall.tool values
  • Provider behavior or execution — normalization is cosmetic, applied after the provider runs

Metadata

Metadata

Assignees

No one assigned

    Labels

    in-progressClaimed by an agent — do not duplicate work

    Projects

    Status

    In progress

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions