refactor: normalize ToolCall names at provider layer, auto-write transcript, deprecate log_format

## Problem

Two related issues with how AgentV records and consumes provider output:

### 1. Provider-specific tool naming leaks into evaluators

The `skill-trigger` evaluator (`packages/core/src/evaluation/evaluators/skill-trigger.ts`) carries ~73 lines of provider-specific tool-name matching logic because each provider represents the same tool differently:

| Provider | Skill invocation | File read | Skill input field | Extra patterns |
|----------|-----------------|-----------|-------------------|----------------|
| Claude | `tool: "Skill"` | `tool: "Read"` | `input.skill` | — |
| Copilot | `tool: "Skill"` or `"skill"` | `tool: "Read File"`, `"readFile"`, `"Read"`, `"readTextFile"` | `input.skill` | `"Using skill: X"` encoded in tool name |
| Codex | — | `tool: "command_execution"` | `input.command` | `"mcp:<server>/<skill>"` encoded in tool name |
| Pi | — | `tool: "read"` | `input.path` | — |

Every new provider requires a new matcher. The same problem will affect any future evaluator that inspects tool calls (tool-trajectory, cost-per-tool, etc.).

**Root cause:** Providers preserve native tool naming in `ToolCall.tool`. Normalization burden falls on every downstream consumer instead of being handled once at the provider boundary.

### 2. Raw event stream logs are the default output, not transcripts

Providers write raw per-event stream logs (one line per `agent_message_chunk`, `tool_call`, etc.) as the primary record of what happened. The normalized transcript format (`TranscriptJsonLine`) — which consolidates messages, pairs tool calls, and includes token usage — is only available via a separate `agentv import` step. This means:

- Users see fragmented chunk events in log files (the bug that motivated PR #1047)
- Offline replay (`--transcript`) requires a manual import step after every eval
- The useful artifact (consolidated transcript) is harder to get than the debug artifact (raw events)

**Current naming is misleading:** `log_format: json` sounds like "output as JSON" but actually means "raw event stream." `log_format: summary` sounds like a reduced version but is actually the more useful consolidated format. Neither produces a proper transcript.

**Prior art:** code-insights normalizes all provider sessions into a unified schema at the import/sync layer. Downstream consumers never see provider-specific naming or raw events.

## Solution

### Phase 1: Normalize ToolCall at the provider/import layer

Create a shared normalization function that maps provider-native tool names and input fields to canonical values. Apply it at the point where each provider constructs `ToolCall` objects.

**Canonical tool names** (use Claude's naming as the canonical set since it's already the cleanest):

| Canonical `tool` | Provider-native variants |
|-------------------|------------------------|
| `"Skill"` | `"skill"`, `"Using skill: X"` (Copilot prefix), `"mcp:<server>/<name>"` (Codex) |
| `"Read"` | `"Read File"`, `"readFile"`, `"readTextFile"`, `"read"`, `"Viewing X"` (Copilot prefix) |
| `"Write"` | `"writeTextFile"`, `"Write File"` |
| `"Edit"` | `"editFile"`, `"Edit File"` |
| `"Bash"` | `"command_execution"` (Codex, when not a read), `"runTerminalCommand"` |

For tool names not in this table, **pass through unchanged** — normalization only covers the tools that evaluators need to reason about.

**Canonical input fields** — normalize alongside the tool name:
- Skill: ensure `input.skill` contains the skill name (extract from tool name prefix if needed)
- Read: ensure `input.file_path` contains the path (copy from `input.path` if that's what the provider uses)

**Implementation:**

Add `normalizeToolCall(providerKind: ProviderKind, tc: ToolCall): ToolCall` in a new file `packages/core/src/evaluation/providers/normalize-tool-call.ts`. This is a pure function — provider kind in, canonical ToolCall out. Use a static mapping table (not if/else chains) so adding a new provider means adding entries, not logic.

**Call sites** — apply normalization where each provider/parser constructs ToolCall objects:

| File | Where to normalize |
|------|--------------------|
| `providers/copilot-cli.ts` | `sessionUpdate` handler where `completedToolCalls.push(...)` |
| `providers/copilot-sdk.ts` | Same pattern |
| `providers/codex.ts` | ToolCall construction (lines ~235-258) |
| `providers/pi-cli.ts` | `extractMessages()` |
| `providers/copilot-log-parser.ts` | Tool name from `req.name ?? req.toolName` |
| `import/claude-parser.ts` | `tool_use` block parsing — Claude is already canonical, but apply for consistency |
| `import/codex-parser.ts` | `function_call` / `custom_tool_call` cases |

Claude providers (`claude-cli.ts`, `claude-sdk.ts`) already emit canonical names. Still wire them through `normalizeToolCall` for safety, but expect it to be a no-op.

### Phase 2: Auto-write transcript JSONL, deprecate `log_format`

**Always write a normalized transcript** as a byproduct of every eval run. This is the `TranscriptJsonLine` format (already defined in `packages/core/src/import/types.ts`) — consolidated messages with canonical tool names, token usage, duration, and cost.

**Deprecate `log_format` (hard, two-release):**

| Release | Behavior |
|---------|----------|
| **v4.15** (this change) | `log_format` still accepted but emits a CLI deprecation warning. New field `stream_log` introduced. Transcript JSONL always written alongside any stream log. |
| **v4.16** (next release) | `log_format` removed. `stream_log` is the only option. |

**New target config field: `stream_log`**

| Value | Behavior |
|-------|----------|
| `false` (default) | No stream log. Only the transcript JSONL is written. |
| `"summary"` | Human-readable consolidated lines (current summary format) |
| `"raw"` | Per-event debug stream (current json format) |

**Naming rationale:**
- `stream_log` describes what it is — a log of the raw event stream. It's an opt-in debug tool.
- The transcript is not a log format option; it's the eval output. It's always written.
- `"raw"` replaces `"json"` because the distinction is content (raw events vs consolidated), not serialization (both are text).

**Transcript output location:** Write to the same log directory (`<provider>/` under `.agentv/logs/`), with a `.transcript.jsonl` extension alongside any stream log. One transcript line per eval case invocation.

**Files to change:**
- `providers/targets.ts` — add `stream_log` field to target config types (lines ~457-552), add deprecation mapping from `log_format` → `stream_log`, update `normalizeXxxLogFormat` functions
- `providers/targets-validator.ts` — add `stream_log` to allowed fields, warn on `log_format`
- `providers/copilot-utils.ts` — `StreamLoggerOptions.format` accepts new values; `CopilotStreamLogger` writes transcript on `close()`
- `providers/types.ts` — update `TargetConfig` wire format type (line ~351)
- All provider files that create `CopilotStreamLogger` — pass the new option
- `apps/cli/src/commands/eval/` — surface deprecation warning in CLI output

### Phase 3: Remove provider-specific matchers from skill-trigger

Once Phase 1 lands, delete from `skill-trigger.ts`:
- `ToolMatcher` interface (lines 26-41)
- `CLAUDE_MATCHER`, `COPILOT_MATCHER`, `PI_CODING_AGENT_MATCHER`, `CODEX_MATCHER` (lines 43-98)
- `PROVIDER_TOOL_SEMANTICS` mapping (lines 104-116)
- Provider-kind lookup in `evaluate()` (line 132)

Replace with a single canonical matcher:

```typescript
function findSkillTrigger(messages: readonly Message[], skillName: string): ToolCall | undefined {
  for (const msg of messages) {
    for (const tc of msg.toolCalls ?? []) {
      if (tc.tool === 'Skill') {
        const skill = String((tc.input as Record<string, unknown>)?.skill ?? '');
        if (skill.includes(skillName)) return tc;
      }
      if (tc.tool === 'Read') {
        const filePath = String((tc.input as Record<string, unknown>)?.file_path ?? '');
        if (filePath.includes(`skills/${skillName}/`)) return tc;
      }
    }
  }
  return undefined;
}
```

skill-trigger **stays as a built-in** — "did the agent use the right skill?" is a universal eval primitive (design principle #2). It just no longer needs provider awareness.

## Non-goals

- Changing the `ToolCall` TypeScript interface — only runtime values change, not the type shape
- Normalizing tool `output` content — only `tool` name and input fields
- Retroactively rewriting existing transcript/log files on disk — only newly-generated outputs change
- Removing stream logging capability — it becomes opt-in, not removed

## Acceptance criteria

- [ ] `normalizeToolCall()` exists with a static mapping table and unit tests covering each provider's native names
- [ ] All providers and import parsers call `normalizeToolCall()` before returning ToolCall objects
- [ ] Every eval run writes a `.transcript.jsonl` file with consolidated `TranscriptJsonLine` entries
- [ ] `log_format` emits a deprecation warning pointing to `stream_log`
- [ ] `stream_log: false` (default) writes no stream log — only the transcript
- [ ] `stream_log: raw` produces the old per-event output; `stream_log: summary` produces consolidated lines
- [ ] `skill-trigger` evaluator has zero provider-specific code
- [ ] `skill-trigger` tests pass against canonical names only
- [ ] `--transcript` evals produce identical results whether transcript was from Claude, Copilot, or Codex
- [ ] Existing eval YAML files using `type: skill-trigger` continue working unchanged

## Risks and migration

**Hard deprecation of `log_format`:**
- v4.15: accepted with warning, mapped to `stream_log` equivalent (`json` → `raw`, `summary` → `summary`)
- v4.16: removed — targets.yaml with `log_format` will fail validation with a clear error message

**Breaking change for code-graders that check provider-native tool names.** A code-grader checking `tc.tool === 'Read File'` breaks because it's now `tc.tool === 'Read'`. Mitigate:
- Document the canonical tool name table in changelog and docs
- The `--transcript` wire format (JSONL on disk) also changes for newly-generated transcripts

**Not a breaking change for:**
- `skill-trigger` eval YAML config (unchanged)
- `llm-grader`, `code-grader`, or any evaluator that doesn't inspect `ToolCall.tool` values
- Provider behavior or execution — normalization is cosmetic, applied after the provider runs

Canonical `tool`	Provider-native variants
`"Skill"`	`"skill"`, `"Using skill: X"` (Copilot prefix), `"mcp:<server>/<name>"` (Codex)
`"Read"`	`"Read File"`, `"readFile"`, `"readTextFile"`, `"read"`, `"Viewing X"` (Copilot prefix)
`"Write"`	`"writeTextFile"`, `"Write File"`
`"Edit"`	`"editFile"`, `"Edit File"`
`"Bash"`	`"command_execution"` (Codex, when not a read), `"runTerminalCommand"`

File	Where to normalize
`providers/copilot-cli.ts`	`sessionUpdate` handler where `completedToolCalls.push(...)`
`providers/copilot-sdk.ts`	Same pattern
`providers/codex.ts`	ToolCall construction (lines ~235-258)
`providers/pi-cli.ts`	`extractMessages()`
`providers/copilot-log-parser.ts`	Tool name from `req.name ?? req.toolName`
`import/claude-parser.ts`	`tool_use` block parsing — Claude is already canonical, but apply for consistency
`import/codex-parser.ts`	`function_call` / `custom_tool_call` cases

Release	Behavior
v4.15 (this change)	`log_format` still accepted but emits a CLI deprecation warning. New field `stream_log` introduced. Transcript JSONL always written alongside any stream log.
v4.16 (next release)	`log_format` removed. `stream_log` is the only option.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: normalize ToolCall names at provider layer, auto-write transcript, deprecate log_format #1059

Problem

1. Provider-specific tool naming leaks into evaluators

2. Raw event stream logs are the default output, not transcripts

Solution

Phase 1: Normalize ToolCall at the provider/import layer

Phase 2: Auto-write transcript JSONL, deprecate `log_format`

Phase 3: Remove provider-specific matchers from skill-trigger

Non-goals

Acceptance criteria

Risks and migration

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Provider	Skill invocation	File read	Skill input field	Extra patterns
Claude	`tool: "Skill"`	`tool: "Read"`	`input.skill`	—
Copilot	`tool: "Skill"` or `"skill"`	`tool: "Read File"`, `"readFile"`, `"Read"`, `"readTextFile"`	`input.skill`	`"Using skill: X"` encoded in tool name
Codex	—	`tool: "command_execution"`	`input.command`	`"mcp:<server>/<skill>"` encoded in tool name
Pi	—	`tool: "read"`	`input.path`	—

Value	Behavior
`false` (default)	No stream log. Only the transcript JSONL is written.
`"summary"`	Human-readable consolidated lines (current summary format)
`"raw"`	Per-event debug stream (current json format)

refactor: normalize ToolCall names at provider layer, auto-write transcript, deprecate log_format #1059

Description

Problem

1. Provider-specific tool naming leaks into evaluators

2. Raw event stream logs are the default output, not transcripts

Solution

Phase 1: Normalize ToolCall at the provider/import layer

Phase 2: Auto-write transcript JSONL, deprecate log_format

Phase 3: Remove provider-specific matchers from skill-trigger

Non-goals

Acceptance criteria

Risks and migration

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Phase 2: Auto-write transcript JSONL, deprecate `log_format`