MiniMax M3 agentic coding probe: 19/21 oracle pass vs Codex 21/21 (one genuine wrong-but-confident miss + one oracle-coverage artifact)

> Small, self-critical agentic-coding probe of MiniMax M3 vs Codex (baseline), shared as feedback. Reproducible harness; failures fact-checked against run journals; one of the two non-passing runs is flagged as an eval-side artifact rather than a model issue.

## Context

I ran a small local agentic coding evaluation for MiniMax M3, using Codex as a baseline. The goal was not a broad benchmark, but to probe realistic coding-agent behavior: autonomous edits, test execution, cross-file coherence, read-only bug review, long repository tracing, token retention, and strict file-boundary following.

## Evaluation setup

- Matrix: 7 scenarios × 2 models × 3 repeats = 42 valid runs.
- Each run used the same sandbox reset point, the same prompt, and the same **model-agnostic oracle** (a per-scenario pass/fail check, independent of what the model claims).
- M3 ran through a local agent harness (Hermes WebUI) that records run-journal telemetry (tool calls, wall time, output tokens, compression events). Codex ran through `codex exec` against the same sandbox + same oracle, but without that telemetry — so the fair head-to-head axis is **oracle pass/fail**, and telemetry is used only to analyze M3's behavior.
- Invalid runs (interrupted, stale journal, workspace mismatch, manually repaired) were excluded.

## Result summary

- **M3: 19/21 oracle pass. Codex: 21/21.**
- Two M3 runs did not pass the oracle — `S4 m3 r1` and `S5 m3 r3` — both *completed* with a final answer, no tool errors, no iteration-limit hits (mechanically tagged `wrong-but-confident`).
- On review, **only `S4 m3 r1` is a genuine model-quality miss** (and M3 passed S4 on the other 2/3 runs). **`S5 m3 r3` is mainly an oracle-coverage artifact** — M3 traced a real, valid alternative code path; the oracle only accepted one path. Treat S5 as low-confidence.
- `compress_events = 0` across the entire run set. S6 passed for M3 as token retention, but this run set does **not** prove post-compression retention, because auto-compression never triggered.

### Per-scenario oracle pass (M3 vs Codex)

| Scenario | Dimension | M3 | Codex |
|---|---|---:|---:|
| S1 | single-file implement | 3/3 | 3/3 |
| S2 | cross-file feature | 3/3 | 3/3 |
| S3 | read-only bug review | 3/3 | 3/3 |
| S4 | test quality (mutation) | 2/3 | 3/3 |
| S5 | long repo trace | 2/3* | 3/3 |
| S6 | long-context token recall | 3/3 | 3/3 |
| S7 | strict file boundary | 3/3 | 3/3 |

\* S5's single failure is largely an oracle-coverage artifact (see Repro 2), not a confirmed model weakness.

M3 telemetry (21 runs): avg wall 81.5s, avg tool calls 10.4, avg output tokens ~2656, tool errors 0, iteration-limit hits 0, compression events 0.

## Repro 1 — `S4 m3 r1`: test quality / mutation oracle (genuine miss)

S4 asks the model to write pytest tests for a `classify(n)` function. The oracle requires the generated tests to (a) pass on the real implementation and (b) **fail on three planted mutants** (mutation testing).

M3 produced a broadly thorough test (branch/boundary cases, even the `bool`-is-`int` edge case) but did not kill all three mutants, so the oracle failed.

```csv
scenario,model,run,oracle_pass,terminal_state,wall_s,tool_calls,tool_completes,tool_errors,compress_events,compressed_events,answer_after_tools,tokens_out,iter_limit_hit,n_events,failure_tag
S4,m3,r1,0,completed,67.9,6,6,0,0,0,1,2252,0,76,wrong-but-confident
```

Caveat: the mutation oracle is a deliberately high bar, and M3 passed S4 on the other 2/3 runs — so this reads as a **consistency/variance issue under a strict bar**, not "M3 can't write quality tests." The exact missed assertion isn't recoverable (the harness redacts tool arguments).

## Repro 2 — `S5 m3 r3`: long repo trace (mostly an oracle artifact, low confidence)

S5 asks the model to trace, in a real repo, how a backend tool invocation becomes a tool card in the browser, and to report three facts: the backend callback, the SSE event name, and the frontend render function.

M3 got 2 of the 3 facts exactly right (SSE event `tool`, render function `appendLiveToolCard`). For the backend callback it traced the **gateway path** (`_gateway_tool_progress_event` — a real function in the repo that also emits a `tool` event ending at `appendLiveToolCard`) instead of the streaming-path callback the oracle hard-coded as the only accepted answer.

```csv
S5,m3,r3,0,completed,170.4,29,29,0,0,0,1,4308,0,218,wrong-but-confident
```

So this is mainly an **oracle-coverage limitation, not an M3 weakness**: M3 traced a real, valid alternative path; the oracle encoded only one. The oracle has since been widened to accept either path. **Please treat this run as low-confidence.**

## What this suggests (and what it doesn't)

M3 looked stable on bounded coding tasks, cross-file edits, read-only review, token recall, and strict single-file boundaries — with no tool failures, runaway loops, or iteration-limit hits across the run set. The one genuine miss is the `wrong-but-confident` shape: a completed, confident answer that fails a deeper check. That matters for coding agents because users may trust a completed answer absent a stronger self-check.

Suggested areas to look at:

1. **Stronger self-checking when the real acceptance bar exceeds "local tests pass"** (e.g. mutation-testing / test-quality tasks). This is the one reasonably supported signal here.
2. **The `wrong-but-confident` failure shape generally** (completed + final answer + deeper check fails), separate from tool/runtime errors.
3. **A dedicated post-compression retention eval** — this run set never triggered compression, so it says nothing about that.
4. *(Lower priority)* Long code-trace precision — but only after eval oracles accept all valid paths; this run's S5 signal was contaminated by oracle coverage, not a confirmed model issue.

## Methodology / privacy notes

- This is a small probe (n=3 per cell), not a statistically significant benchmark.
- Codex baseline carries oracle pass/fail only (no telemetry), so cross-model comparison is on pass rate.
- All shared data is sanitized: local absolute paths and raw journal paths are replaced with stable labels. Detailed artifacts (full results CSV, per-scenario prompts + oracles, and sanitized failure repros) are available on request.

---

<details>
<summary>中文版本</summary>

## 背景

我对 MiniMax M3 做了一轮小规模本地 agentic coding probe，并用 Codex 作为 baseline。这不是通用 benchmark，而是观察真实 coding agent 工作流：自主修改、运行测试、跨文件一致性、只读 bug review、长链路代码追踪、token retention、严格文件边界遵守。

## 测试设置

- 矩阵：7 个场景 × 2 个模型 × 3 次重复 = 42 个有效 run。
- 每个 run 使用相同 sandbox 起点、相同 prompt、相同 **model-agnostic oracle**（逐场景的客观对错判定，与模型自述无关）。
- M3 通过本地 agent harness（Hermes WebUI）运行，可记录 run-journal 遥测（工具调用、耗时、输出 token、压缩事件）。Codex 通过 `codex exec` 在相同 sandbox + 相同 oracle 下运行，但无该遥测——所以公平主轴是 **oracle 通过率**，遥测仅用于分析 M3 自身行为。
- 中断、旧 journal、workspace mismatch、人工修复的 run 均未计入。

## 结果摘要

- **M3：19/21 oracle 通过。Codex：21/21。**
- 两个未过 oracle 的 run：`S4 m3 r1` 与 `S5 m3 r3`，均为 completed、有最终回答、无工具错误、无 iteration-limit hit（机械标记 `wrong-but-confident`）。
- 复核后：**只有 `S4 m3 r1` 是 M3 自身质量层面的 miss**（且 S4 另外 2/3 次通过）；**`S5 m3 r3` 主要是 oracle 路径覆盖不足**——M3 追踪了一条真实有效的替代路径，oracle 只接受其中一条。S5 请按低置信度看待。
- 全部 run 的 `compress_events = 0`。S6 中 M3 完成了 token retention，但因为没有触发自动压缩，**不能**据此说明 M3 在压缩后仍能保持上下文。

## Repro 1 — `S4 m3 r1`：测试质量 / mutation oracle（真实 miss）

S4 要求为 `classify(n)` 写 pytest 测试。oracle 要求测试 (a) 在真实实现上通过，(b) **能杀掉三个植入的变异体**（mutation testing）。M3 写了覆盖面不差的测试（分支、边界、甚至 `bool` 是 `int` 的 edge case），但没杀掉全部变异体，故 oracle 失败。

注意：mutation oracle 是刻意拉高的标准，且 M3 在 S4 另外 2/3 次通过——所以更像严格标准下的**一致性/方差问题**，而非“不会写质量测试”。具体漏杀哪个变异体无法从 journal 还原（harness 对工具参数脱敏）。

## Repro 2 — `S5 m3 r3`：长链路 repo trace（主要是 oracle 假阴性，低置信度）

S5 要求在真实 repo 中追踪 backend tool invocation 如何变成 browser tool card，并报告三个事实：backend callback、SSE event 名、frontend render 函数。M3 答对了其中 2 个（`tool` 事件、`appendLiveToolCard`）；backend callback 给的是 **gateway 路径**的 `_gateway_tool_progress_event`（repo 中真实存在、同样会发 `tool` 事件并终结于 `appendLiveToolCard`），而非 oracle 唯一硬编码接受的 streaming 路径 callback。

所以这主要是 **oracle 路径覆盖不足，不是 M3 弱点**。该 oracle 已放宽为接受任一路径。**请将此 run 视为低置信度。**

## 希望关注的问题

1. 当真实验收标准强于“本地测试通过”时（如 mutation testing），模型能否主动识别更强标准——本轮**唯一较有支撑**的信号。
2. `wrong-but-confident` 形态（completed + 有最终答案 + 深层校验不过），区别于工具/运行时报错。
3. 单独做 post-compression retention eval（本轮未触发压缩）。
4. （较低优先级）长 repo trace 精度——但需先把 oracle 修成接受所有有效路径再评。

## 方法/隐私说明

小规模 probe（每格 n=3），非统计显著 benchmark；Codex 基线仅有 oracle pass/fail。所有共享数据已清洗（本机绝对路径与 raw journal 路径替换为稳定标签）。完整结果 CSV、逐场景 prompt+oracle、清洗后的失败 repro 可按需提供。

</details>

---

*Shared as good-faith eval feedback. AI-assisted preparation (Claude / Claude Code) for harness, analysis, and sanitization; all runs executed locally by me.*


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MiniMax M3 agentic coding probe: 19/21 oracle pass vs Codex 21/21 (one genuine wrong-but-confident miss + one oracle-coverage artifact) #49

Context

Evaluation setup

Result summary

Per-scenario oracle pass (M3 vs Codex)

Repro 1 — `S4 m3 r1`: test quality / mutation oracle (genuine miss)

Repro 2 — `S5 m3 r3`: long repo trace (mostly an oracle artifact, low confidence)

What this suggests (and what it doesn't)

Methodology / privacy notes

背景

测试设置

结果摘要

Repro 1 — `S4 m3 r1`：测试质量 / mutation oracle（真实 miss）

Repro 2 — `S5 m3 r3`：长链路 repo trace（主要是 oracle 假阴性，低置信度）

希望关注的问题

方法/隐私说明

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Scenario	Dimension	M3	Codex
S1	single-file implement	3/3	3/3
S2	cross-file feature	3/3	3/3
S3	read-only bug review	3/3	3/3
S4	test quality (mutation)	2/3	3/3
S5	long repo trace	2/3*	3/3
S6	long-context token recall	3/3	3/3
S7	strict file boundary	3/3	3/3

MiniMax M3 agentic coding probe: 19/21 oracle pass vs Codex 21/21 (one genuine wrong-but-confident miss + one oracle-coverage artifact) #49

Description

Context

Evaluation setup

Result summary

Per-scenario oracle pass (M3 vs Codex)

Repro 1 — S4 m3 r1: test quality / mutation oracle (genuine miss)

Repro 2 — S5 m3 r3: long repo trace (mostly an oracle artifact, low confidence)

What this suggests (and what it doesn't)

Methodology / privacy notes

背景

测试设置

结果摘要

Repro 1 — S4 m3 r1：测试质量 / mutation oracle（真实 miss）

Repro 2 — S5 m3 r3：长链路 repo trace（主要是 oracle 假阴性，低置信度）

希望关注的问题

方法/隐私说明

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Repro 1 — `S4 m3 r1`: test quality / mutation oracle (genuine miss)

Repro 2 — `S5 m3 r3`: long repo trace (mostly an oracle artifact, low confidence)

Repro 1 — `S4 m3 r1`：测试质量 / mutation oracle（真实 miss）

Repro 2 — `S5 m3 r3`：长链路 repo trace（主要是 oracle 假阴性，低置信度）