Small, self-critical agentic-coding probe of MiniMax M3 vs Codex (baseline), shared as feedback. Reproducible harness; failures fact-checked against run journals; one of the two non-passing runs is flagged as an eval-side artifact rather than a model issue.
Context
I ran a small local agentic coding evaluation for MiniMax M3, using Codex as a baseline. The goal was not a broad benchmark, but to probe realistic coding-agent behavior: autonomous edits, test execution, cross-file coherence, read-only bug review, long repository tracing, token retention, and strict file-boundary following.
Evaluation setup
- Matrix: 7 scenarios × 2 models × 3 repeats = 42 valid runs.
- Each run used the same sandbox reset point, the same prompt, and the same model-agnostic oracle (a per-scenario pass/fail check, independent of what the model claims).
- M3 ran through a local agent harness (Hermes WebUI) that records run-journal telemetry (tool calls, wall time, output tokens, compression events). Codex ran through
codex exec against the same sandbox + same oracle, but without that telemetry — so the fair head-to-head axis is oracle pass/fail, and telemetry is used only to analyze M3's behavior.
- Invalid runs (interrupted, stale journal, workspace mismatch, manually repaired) were excluded.
Result summary
- M3: 19/21 oracle pass. Codex: 21/21.
- Two M3 runs did not pass the oracle —
S4 m3 r1 and S5 m3 r3 — both completed with a final answer, no tool errors, no iteration-limit hits (mechanically tagged wrong-but-confident).
- On review, only
S4 m3 r1 is a genuine model-quality miss (and M3 passed S4 on the other 2/3 runs). S5 m3 r3 is mainly an oracle-coverage artifact — M3 traced a real, valid alternative code path; the oracle only accepted one path. Treat S5 as low-confidence.
compress_events = 0 across the entire run set. S6 passed for M3 as token retention, but this run set does not prove post-compression retention, because auto-compression never triggered.
Per-scenario oracle pass (M3 vs Codex)
| Scenario |
Dimension |
M3 |
Codex |
| S1 |
single-file implement |
3/3 |
3/3 |
| S2 |
cross-file feature |
3/3 |
3/3 |
| S3 |
read-only bug review |
3/3 |
3/3 |
| S4 |
test quality (mutation) |
2/3 |
3/3 |
| S5 |
long repo trace |
2/3* |
3/3 |
| S6 |
long-context token recall |
3/3 |
3/3 |
| S7 |
strict file boundary |
3/3 |
3/3 |
* S5's single failure is largely an oracle-coverage artifact (see Repro 2), not a confirmed model weakness.
M3 telemetry (21 runs): avg wall 81.5s, avg tool calls 10.4, avg output tokens ~2656, tool errors 0, iteration-limit hits 0, compression events 0.
Repro 1 — S4 m3 r1: test quality / mutation oracle (genuine miss)
S4 asks the model to write pytest tests for a classify(n) function. The oracle requires the generated tests to (a) pass on the real implementation and (b) fail on three planted mutants (mutation testing).
M3 produced a broadly thorough test (branch/boundary cases, even the bool-is-int edge case) but did not kill all three mutants, so the oracle failed.
scenario,model,run,oracle_pass,terminal_state,wall_s,tool_calls,tool_completes,tool_errors,compress_events,compressed_events,answer_after_tools,tokens_out,iter_limit_hit,n_events,failure_tag
S4,m3,r1,0,completed,67.9,6,6,0,0,0,1,2252,0,76,wrong-but-confident
Caveat: the mutation oracle is a deliberately high bar, and M3 passed S4 on the other 2/3 runs — so this reads as a consistency/variance issue under a strict bar, not "M3 can't write quality tests." The exact missed assertion isn't recoverable (the harness redacts tool arguments).
Repro 2 — S5 m3 r3: long repo trace (mostly an oracle artifact, low confidence)
S5 asks the model to trace, in a real repo, how a backend tool invocation becomes a tool card in the browser, and to report three facts: the backend callback, the SSE event name, and the frontend render function.
M3 got 2 of the 3 facts exactly right (SSE event tool, render function appendLiveToolCard). For the backend callback it traced the gateway path (_gateway_tool_progress_event — a real function in the repo that also emits a tool event ending at appendLiveToolCard) instead of the streaming-path callback the oracle hard-coded as the only accepted answer.
S5,m3,r3,0,completed,170.4,29,29,0,0,0,1,4308,0,218,wrong-but-confident
So this is mainly an oracle-coverage limitation, not an M3 weakness: M3 traced a real, valid alternative path; the oracle encoded only one. The oracle has since been widened to accept either path. Please treat this run as low-confidence.
What this suggests (and what it doesn't)
M3 looked stable on bounded coding tasks, cross-file edits, read-only review, token recall, and strict single-file boundaries — with no tool failures, runaway loops, or iteration-limit hits across the run set. The one genuine miss is the wrong-but-confident shape: a completed, confident answer that fails a deeper check. That matters for coding agents because users may trust a completed answer absent a stronger self-check.
Suggested areas to look at:
- Stronger self-checking when the real acceptance bar exceeds "local tests pass" (e.g. mutation-testing / test-quality tasks). This is the one reasonably supported signal here.
- The
wrong-but-confident failure shape generally (completed + final answer + deeper check fails), separate from tool/runtime errors.
- A dedicated post-compression retention eval — this run set never triggered compression, so it says nothing about that.
- (Lower priority) Long code-trace precision — but only after eval oracles accept all valid paths; this run's S5 signal was contaminated by oracle coverage, not a confirmed model issue.
Methodology / privacy notes
- This is a small probe (n=3 per cell), not a statistically significant benchmark.
- Codex baseline carries oracle pass/fail only (no telemetry), so cross-model comparison is on pass rate.
- All shared data is sanitized: local absolute paths and raw journal paths are replaced with stable labels. Detailed artifacts (full results CSV, per-scenario prompts + oracles, and sanitized failure repros) are available on request.
中文版本
背景
我对 MiniMax M3 做了一轮小规模本地 agentic coding probe,并用 Codex 作为 baseline。这不是通用 benchmark,而是观察真实 coding agent 工作流:自主修改、运行测试、跨文件一致性、只读 bug review、长链路代码追踪、token retention、严格文件边界遵守。
测试设置
- 矩阵:7 个场景 × 2 个模型 × 3 次重复 = 42 个有效 run。
- 每个 run 使用相同 sandbox 起点、相同 prompt、相同 model-agnostic oracle(逐场景的客观对错判定,与模型自述无关)。
- M3 通过本地 agent harness(Hermes WebUI)运行,可记录 run-journal 遥测(工具调用、耗时、输出 token、压缩事件)。Codex 通过
codex exec 在相同 sandbox + 相同 oracle 下运行,但无该遥测——所以公平主轴是 oracle 通过率,遥测仅用于分析 M3 自身行为。
- 中断、旧 journal、workspace mismatch、人工修复的 run 均未计入。
结果摘要
- M3:19/21 oracle 通过。Codex:21/21。
- 两个未过 oracle 的 run:
S4 m3 r1 与 S5 m3 r3,均为 completed、有最终回答、无工具错误、无 iteration-limit hit(机械标记 wrong-but-confident)。
- 复核后:只有
S4 m3 r1 是 M3 自身质量层面的 miss(且 S4 另外 2/3 次通过);S5 m3 r3 主要是 oracle 路径覆盖不足——M3 追踪了一条真实有效的替代路径,oracle 只接受其中一条。S5 请按低置信度看待。
- 全部 run 的
compress_events = 0。S6 中 M3 完成了 token retention,但因为没有触发自动压缩,不能据此说明 M3 在压缩后仍能保持上下文。
Repro 1 — S4 m3 r1:测试质量 / mutation oracle(真实 miss)
S4 要求为 classify(n) 写 pytest 测试。oracle 要求测试 (a) 在真实实现上通过,(b) 能杀掉三个植入的变异体(mutation testing)。M3 写了覆盖面不差的测试(分支、边界、甚至 bool 是 int 的 edge case),但没杀掉全部变异体,故 oracle 失败。
注意:mutation oracle 是刻意拉高的标准,且 M3 在 S4 另外 2/3 次通过——所以更像严格标准下的一致性/方差问题,而非“不会写质量测试”。具体漏杀哪个变异体无法从 journal 还原(harness 对工具参数脱敏)。
Repro 2 — S5 m3 r3:长链路 repo trace(主要是 oracle 假阴性,低置信度)
S5 要求在真实 repo 中追踪 backend tool invocation 如何变成 browser tool card,并报告三个事实:backend callback、SSE event 名、frontend render 函数。M3 答对了其中 2 个(tool 事件、appendLiveToolCard);backend callback 给的是 gateway 路径的 _gateway_tool_progress_event(repo 中真实存在、同样会发 tool 事件并终结于 appendLiveToolCard),而非 oracle 唯一硬编码接受的 streaming 路径 callback。
所以这主要是 oracle 路径覆盖不足,不是 M3 弱点。该 oracle 已放宽为接受任一路径。请将此 run 视为低置信度。
希望关注的问题
- 当真实验收标准强于“本地测试通过”时(如 mutation testing),模型能否主动识别更强标准——本轮唯一较有支撑的信号。
wrong-but-confident 形态(completed + 有最终答案 + 深层校验不过),区别于工具/运行时报错。
- 单独做 post-compression retention eval(本轮未触发压缩)。
- (较低优先级)长 repo trace 精度——但需先把 oracle 修成接受所有有效路径再评。
方法/隐私说明
小规模 probe(每格 n=3),非统计显著 benchmark;Codex 基线仅有 oracle pass/fail。所有共享数据已清洗(本机绝对路径与 raw journal 路径替换为稳定标签)。完整结果 CSV、逐场景 prompt+oracle、清洗后的失败 repro 可按需提供。
Shared as good-faith eval feedback. AI-assisted preparation (Claude / Claude Code) for harness, analysis, and sanitization; all runs executed locally by me.
Context
I ran a small local agentic coding evaluation for MiniMax M3, using Codex as a baseline. The goal was not a broad benchmark, but to probe realistic coding-agent behavior: autonomous edits, test execution, cross-file coherence, read-only bug review, long repository tracing, token retention, and strict file-boundary following.
Evaluation setup
codex execagainst the same sandbox + same oracle, but without that telemetry — so the fair head-to-head axis is oracle pass/fail, and telemetry is used only to analyze M3's behavior.Result summary
S4 m3 r1andS5 m3 r3— both completed with a final answer, no tool errors, no iteration-limit hits (mechanically taggedwrong-but-confident).S4 m3 r1is a genuine model-quality miss (and M3 passed S4 on the other 2/3 runs).S5 m3 r3is mainly an oracle-coverage artifact — M3 traced a real, valid alternative code path; the oracle only accepted one path. Treat S5 as low-confidence.compress_events = 0across the entire run set. S6 passed for M3 as token retention, but this run set does not prove post-compression retention, because auto-compression never triggered.Per-scenario oracle pass (M3 vs Codex)
* S5's single failure is largely an oracle-coverage artifact (see Repro 2), not a confirmed model weakness.
M3 telemetry (21 runs): avg wall 81.5s, avg tool calls 10.4, avg output tokens ~2656, tool errors 0, iteration-limit hits 0, compression events 0.
Repro 1 —
S4 m3 r1: test quality / mutation oracle (genuine miss)S4 asks the model to write pytest tests for a
classify(n)function. The oracle requires the generated tests to (a) pass on the real implementation and (b) fail on three planted mutants (mutation testing).M3 produced a broadly thorough test (branch/boundary cases, even the
bool-is-intedge case) but did not kill all three mutants, so the oracle failed.Caveat: the mutation oracle is a deliberately high bar, and M3 passed S4 on the other 2/3 runs — so this reads as a consistency/variance issue under a strict bar, not "M3 can't write quality tests." The exact missed assertion isn't recoverable (the harness redacts tool arguments).
Repro 2 —
S5 m3 r3: long repo trace (mostly an oracle artifact, low confidence)S5 asks the model to trace, in a real repo, how a backend tool invocation becomes a tool card in the browser, and to report three facts: the backend callback, the SSE event name, and the frontend render function.
M3 got 2 of the 3 facts exactly right (SSE event
tool, render functionappendLiveToolCard). For the backend callback it traced the gateway path (_gateway_tool_progress_event— a real function in the repo that also emits atoolevent ending atappendLiveToolCard) instead of the streaming-path callback the oracle hard-coded as the only accepted answer.So this is mainly an oracle-coverage limitation, not an M3 weakness: M3 traced a real, valid alternative path; the oracle encoded only one. The oracle has since been widened to accept either path. Please treat this run as low-confidence.
What this suggests (and what it doesn't)
M3 looked stable on bounded coding tasks, cross-file edits, read-only review, token recall, and strict single-file boundaries — with no tool failures, runaway loops, or iteration-limit hits across the run set. The one genuine miss is the
wrong-but-confidentshape: a completed, confident answer that fails a deeper check. That matters for coding agents because users may trust a completed answer absent a stronger self-check.Suggested areas to look at:
wrong-but-confidentfailure shape generally (completed + final answer + deeper check fails), separate from tool/runtime errors.Methodology / privacy notes
中文版本
背景
我对 MiniMax M3 做了一轮小规模本地 agentic coding probe,并用 Codex 作为 baseline。这不是通用 benchmark,而是观察真实 coding agent 工作流:自主修改、运行测试、跨文件一致性、只读 bug review、长链路代码追踪、token retention、严格文件边界遵守。
测试设置
codex exec在相同 sandbox + 相同 oracle 下运行,但无该遥测——所以公平主轴是 oracle 通过率,遥测仅用于分析 M3 自身行为。结果摘要
S4 m3 r1与S5 m3 r3,均为 completed、有最终回答、无工具错误、无 iteration-limit hit(机械标记wrong-but-confident)。S4 m3 r1是 M3 自身质量层面的 miss(且 S4 另外 2/3 次通过);S5 m3 r3主要是 oracle 路径覆盖不足——M3 追踪了一条真实有效的替代路径,oracle 只接受其中一条。S5 请按低置信度看待。compress_events = 0。S6 中 M3 完成了 token retention,但因为没有触发自动压缩,不能据此说明 M3 在压缩后仍能保持上下文。Repro 1 —
S4 m3 r1:测试质量 / mutation oracle(真实 miss)S4 要求为
classify(n)写 pytest 测试。oracle 要求测试 (a) 在真实实现上通过,(b) 能杀掉三个植入的变异体(mutation testing)。M3 写了覆盖面不差的测试(分支、边界、甚至bool是int的 edge case),但没杀掉全部变异体,故 oracle 失败。注意:mutation oracle 是刻意拉高的标准,且 M3 在 S4 另外 2/3 次通过——所以更像严格标准下的一致性/方差问题,而非“不会写质量测试”。具体漏杀哪个变异体无法从 journal 还原(harness 对工具参数脱敏)。
Repro 2 —
S5 m3 r3:长链路 repo trace(主要是 oracle 假阴性,低置信度)S5 要求在真实 repo 中追踪 backend tool invocation 如何变成 browser tool card,并报告三个事实:backend callback、SSE event 名、frontend render 函数。M3 答对了其中 2 个(
tool事件、appendLiveToolCard);backend callback 给的是 gateway 路径的_gateway_tool_progress_event(repo 中真实存在、同样会发tool事件并终结于appendLiveToolCard),而非 oracle 唯一硬编码接受的 streaming 路径 callback。所以这主要是 oracle 路径覆盖不足,不是 M3 弱点。该 oracle 已放宽为接受任一路径。请将此 run 视为低置信度。
希望关注的问题
wrong-but-confident形态(completed + 有最终答案 + 深层校验不过),区别于工具/运行时报错。方法/隐私说明
小规模 probe(每格 n=3),非统计显著 benchmark;Codex 基线仅有 oracle pass/fail。所有共享数据已清洗(本机绝对路径与 raw journal 路径替换为稳定标签)。完整结果 CSV、逐场景 prompt+oracle、清洗后的失败 repro 可按需提供。
Shared as good-faith eval feedback. AI-assisted preparation (Claude / Claude Code) for harness, analysis, and sanitization; all runs executed locally by me.