Skip to content

MiniMax M3 agentic coding probe: 19/21 oracle pass vs Codex 21/21 (one genuine wrong-but-confident miss + one oracle-coverage artifact) #49

@franksong2702

Description

@franksong2702

Small, self-critical agentic-coding probe of MiniMax M3 vs Codex (baseline), shared as feedback. Reproducible harness; failures fact-checked against run journals; one of the two non-passing runs is flagged as an eval-side artifact rather than a model issue.

Context

I ran a small local agentic coding evaluation for MiniMax M3, using Codex as a baseline. The goal was not a broad benchmark, but to probe realistic coding-agent behavior: autonomous edits, test execution, cross-file coherence, read-only bug review, long repository tracing, token retention, and strict file-boundary following.

Evaluation setup

  • Matrix: 7 scenarios × 2 models × 3 repeats = 42 valid runs.
  • Each run used the same sandbox reset point, the same prompt, and the same model-agnostic oracle (a per-scenario pass/fail check, independent of what the model claims).
  • M3 ran through a local agent harness (Hermes WebUI) that records run-journal telemetry (tool calls, wall time, output tokens, compression events). Codex ran through codex exec against the same sandbox + same oracle, but without that telemetry — so the fair head-to-head axis is oracle pass/fail, and telemetry is used only to analyze M3's behavior.
  • Invalid runs (interrupted, stale journal, workspace mismatch, manually repaired) were excluded.

Result summary

  • M3: 19/21 oracle pass. Codex: 21/21.
  • Two M3 runs did not pass the oracle — S4 m3 r1 and S5 m3 r3 — both completed with a final answer, no tool errors, no iteration-limit hits (mechanically tagged wrong-but-confident).
  • On review, only S4 m3 r1 is a genuine model-quality miss (and M3 passed S4 on the other 2/3 runs). S5 m3 r3 is mainly an oracle-coverage artifact — M3 traced a real, valid alternative code path; the oracle only accepted one path. Treat S5 as low-confidence.
  • compress_events = 0 across the entire run set. S6 passed for M3 as token retention, but this run set does not prove post-compression retention, because auto-compression never triggered.

Per-scenario oracle pass (M3 vs Codex)

Scenario Dimension M3 Codex
S1 single-file implement 3/3 3/3
S2 cross-file feature 3/3 3/3
S3 read-only bug review 3/3 3/3
S4 test quality (mutation) 2/3 3/3
S5 long repo trace 2/3* 3/3
S6 long-context token recall 3/3 3/3
S7 strict file boundary 3/3 3/3

* S5's single failure is largely an oracle-coverage artifact (see Repro 2), not a confirmed model weakness.

M3 telemetry (21 runs): avg wall 81.5s, avg tool calls 10.4, avg output tokens ~2656, tool errors 0, iteration-limit hits 0, compression events 0.

Repro 1 — S4 m3 r1: test quality / mutation oracle (genuine miss)

S4 asks the model to write pytest tests for a classify(n) function. The oracle requires the generated tests to (a) pass on the real implementation and (b) fail on three planted mutants (mutation testing).

M3 produced a broadly thorough test (branch/boundary cases, even the bool-is-int edge case) but did not kill all three mutants, so the oracle failed.

scenario,model,run,oracle_pass,terminal_state,wall_s,tool_calls,tool_completes,tool_errors,compress_events,compressed_events,answer_after_tools,tokens_out,iter_limit_hit,n_events,failure_tag
S4,m3,r1,0,completed,67.9,6,6,0,0,0,1,2252,0,76,wrong-but-confident

Caveat: the mutation oracle is a deliberately high bar, and M3 passed S4 on the other 2/3 runs — so this reads as a consistency/variance issue under a strict bar, not "M3 can't write quality tests." The exact missed assertion isn't recoverable (the harness redacts tool arguments).

Repro 2 — S5 m3 r3: long repo trace (mostly an oracle artifact, low confidence)

S5 asks the model to trace, in a real repo, how a backend tool invocation becomes a tool card in the browser, and to report three facts: the backend callback, the SSE event name, and the frontend render function.

M3 got 2 of the 3 facts exactly right (SSE event tool, render function appendLiveToolCard). For the backend callback it traced the gateway path (_gateway_tool_progress_event — a real function in the repo that also emits a tool event ending at appendLiveToolCard) instead of the streaming-path callback the oracle hard-coded as the only accepted answer.

S5,m3,r3,0,completed,170.4,29,29,0,0,0,1,4308,0,218,wrong-but-confident

So this is mainly an oracle-coverage limitation, not an M3 weakness: M3 traced a real, valid alternative path; the oracle encoded only one. The oracle has since been widened to accept either path. Please treat this run as low-confidence.

What this suggests (and what it doesn't)

M3 looked stable on bounded coding tasks, cross-file edits, read-only review, token recall, and strict single-file boundaries — with no tool failures, runaway loops, or iteration-limit hits across the run set. The one genuine miss is the wrong-but-confident shape: a completed, confident answer that fails a deeper check. That matters for coding agents because users may trust a completed answer absent a stronger self-check.

Suggested areas to look at:

  1. Stronger self-checking when the real acceptance bar exceeds "local tests pass" (e.g. mutation-testing / test-quality tasks). This is the one reasonably supported signal here.
  2. The wrong-but-confident failure shape generally (completed + final answer + deeper check fails), separate from tool/runtime errors.
  3. A dedicated post-compression retention eval — this run set never triggered compression, so it says nothing about that.
  4. (Lower priority) Long code-trace precision — but only after eval oracles accept all valid paths; this run's S5 signal was contaminated by oracle coverage, not a confirmed model issue.

Methodology / privacy notes

  • This is a small probe (n=3 per cell), not a statistically significant benchmark.
  • Codex baseline carries oracle pass/fail only (no telemetry), so cross-model comparison is on pass rate.
  • All shared data is sanitized: local absolute paths and raw journal paths are replaced with stable labels. Detailed artifacts (full results CSV, per-scenario prompts + oracles, and sanitized failure repros) are available on request.

中文版本

背景

我对 MiniMax M3 做了一轮小规模本地 agentic coding probe,并用 Codex 作为 baseline。这不是通用 benchmark,而是观察真实 coding agent 工作流:自主修改、运行测试、跨文件一致性、只读 bug review、长链路代码追踪、token retention、严格文件边界遵守。

测试设置

  • 矩阵:7 个场景 × 2 个模型 × 3 次重复 = 42 个有效 run。
  • 每个 run 使用相同 sandbox 起点、相同 prompt、相同 model-agnostic oracle(逐场景的客观对错判定,与模型自述无关)。
  • M3 通过本地 agent harness(Hermes WebUI)运行,可记录 run-journal 遥测(工具调用、耗时、输出 token、压缩事件)。Codex 通过 codex exec 在相同 sandbox + 相同 oracle 下运行,但无该遥测——所以公平主轴是 oracle 通过率,遥测仅用于分析 M3 自身行为。
  • 中断、旧 journal、workspace mismatch、人工修复的 run 均未计入。

结果摘要

  • M3:19/21 oracle 通过。Codex:21/21。
  • 两个未过 oracle 的 run:S4 m3 r1S5 m3 r3,均为 completed、有最终回答、无工具错误、无 iteration-limit hit(机械标记 wrong-but-confident)。
  • 复核后:只有 S4 m3 r1 是 M3 自身质量层面的 miss(且 S4 另外 2/3 次通过);S5 m3 r3 主要是 oracle 路径覆盖不足——M3 追踪了一条真实有效的替代路径,oracle 只接受其中一条。S5 请按低置信度看待。
  • 全部 run 的 compress_events = 0。S6 中 M3 完成了 token retention,但因为没有触发自动压缩,不能据此说明 M3 在压缩后仍能保持上下文。

Repro 1 — S4 m3 r1:测试质量 / mutation oracle(真实 miss)

S4 要求为 classify(n) 写 pytest 测试。oracle 要求测试 (a) 在真实实现上通过,(b) 能杀掉三个植入的变异体(mutation testing)。M3 写了覆盖面不差的测试(分支、边界、甚至 boolint 的 edge case),但没杀掉全部变异体,故 oracle 失败。

注意:mutation oracle 是刻意拉高的标准,且 M3 在 S4 另外 2/3 次通过——所以更像严格标准下的一致性/方差问题,而非“不会写质量测试”。具体漏杀哪个变异体无法从 journal 还原(harness 对工具参数脱敏)。

Repro 2 — S5 m3 r3:长链路 repo trace(主要是 oracle 假阴性,低置信度)

S5 要求在真实 repo 中追踪 backend tool invocation 如何变成 browser tool card,并报告三个事实:backend callback、SSE event 名、frontend render 函数。M3 答对了其中 2 个(tool 事件、appendLiveToolCard);backend callback 给的是 gateway 路径_gateway_tool_progress_event(repo 中真实存在、同样会发 tool 事件并终结于 appendLiveToolCard),而非 oracle 唯一硬编码接受的 streaming 路径 callback。

所以这主要是 oracle 路径覆盖不足,不是 M3 弱点。该 oracle 已放宽为接受任一路径。请将此 run 视为低置信度。

希望关注的问题

  1. 当真实验收标准强于“本地测试通过”时(如 mutation testing),模型能否主动识别更强标准——本轮唯一较有支撑的信号。
  2. wrong-but-confident 形态(completed + 有最终答案 + 深层校验不过),区别于工具/运行时报错。
  3. 单独做 post-compression retention eval(本轮未触发压缩)。
  4. (较低优先级)长 repo trace 精度——但需先把 oracle 修成接受所有有效路径再评。

方法/隐私说明

小规模 probe(每格 n=3),非统计显著 benchmark;Codex 基线仅有 oracle pass/fail。所有共享数据已清洗(本机绝对路径与 raw journal 路径替换为稳定标签)。完整结果 CSV、逐场景 prompt+oracle、清洗后的失败 repro 可按需提供。


Shared as good-faith eval feedback. AI-assisted preparation (Claude / Claude Code) for harness, analysis, and sanitization; all runs executed locally by me.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions