Skip to content

fix(harness): correct codex & openai reasoning stream envelopes#441

Merged
declan-scale merged 3 commits into
nextfrom
declan-scale/fix-codex-openai-reasoning-stream
Jun 23, 2026
Merged

fix(harness): correct codex & openai reasoning stream envelopes#441
declan-scale merged 3 commits into
nextfrom
declan-scale/fix-codex-openai-reasoning-stream

Conversation

@declan-scale

@declan-scale declan-scale commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

What

Fixes two reasoning-stream issues found while reviewing the unified-harness release (#424). Both are confined to the streaming envelope (the live StreamTaskMessage* sequence); neither changes the persisted message type, but #1 produces a duplicate persisted message.

1. Codex: duplicate / orphaned reasoning message (🔴 user-visible)

_codex_sync.py emitted Start(ReasoningContent, active) then Full(ReasoningContent, static) at the open index, with no Done. auto_send handles Full by opening its own throwaway streaming context (it ignores event.index), so the original Start context was never closed until end-of-turn teardown — persisting a second, near-empty reasoning message with a later timestamp (duplicate + out-of-order).

Fix: mirror the working _claude_code_sync.py pattern — stream the final reasoning as ReasoningSummaryDelta + ReasoningContentDelta on the open index, then close with Done. The open context accumulates the final ReasoningContent and closes cleanly as one message. The no-started case opens the Start lazily and closes it the same way; the empty-reasoning case still closes with a bare Done.

2. OpenAI: reasoning Start content type regressed ReasoningContentTextContent (🟠)

convert_openai_to_agentex_events opened reasoning summary/content messages with a TextContent Start. On the migrated auto_send/Temporal path this regressed the pre-migration behavior (which started reasoning with ReasoningContent), so any consumer that branches on the start event's content type to show a "thinking" indicator now sees plain text. The persisted content is rebuilt from the reasoning deltas regardless, so only the live envelope was affected.

Fix: emit ReasoningContent(style="active") for both reasoning Starts. This aligns the converter with the codex/claude_code taps, the (already-corrected) langgraph-sync converter, and what the OpenAI conformance suite already treats as canonical.

Testing

  • Rewrote the codex reasoning tests to assert the Start + summary/content deltas + Done sequence and that no Full(ReasoningContent) is emitted; added an empty-block-closes-with-Done case.
  • tests/lib/adk/test_codex_sync.py, tests/lib/adk/providers/test_openai_turn.py, tests/lib/adk/test_codex_turn.py, tests/lib/adk/test_claude_code_sync.py, tests/lib/adk/providers/, and the full tests/lib/core/harness/ (incl. conformance) suites pass (203 + 87 green across runs).
  • ruff format + ruff check clean on all changed files.

🤖 Generated with Claude Code

Greptile Summary

This PR fixes two streaming envelope bugs in the reasoning message pipeline: a Codex duplicate-message issue caused by mixing Start+Full (which leaves the open context dangling via auto_send's index-ignoring behaviour), and an OpenAI regression where reasoning Start events were typed as TextContent instead of ReasoningContent.

  • Codex (_codex_sync.py): item.completed now emits ReasoningSummaryDelta + ReasoningContentDelta + Done on the already-open index, mirroring the _claude_code_sync.py pattern; a lazy Start is opened when no item.started preceded it, and an empty block still closes cleanly with a bare Done.
  • OpenAI (sync_provider.py): Both reasoning_summary and reasoning_content Start events now carry ReasoningContent(style="active") so downstream consumers that branch on the start-event type correctly render a thinking indicator.
  • Tests (test_codex_sync.py): Existing reasoning tests are rewritten to assert the full Start → deltas → Done sequence and explicitly assert no Full(ReasoningContent) is emitted; a new test covers the empty-block case.

Confidence Score: 4/5

Safe to merge — both fixes are well-scoped to the streaming envelope layer and don't touch persisted message types; the production logic is correct.

The Codex and OpenAI changes are straightforward and well-explained. The lazy-open path in _codex_sync.py correctly opens a Start, emits both a ReasoningSummaryDelta and a ReasoningContentDelta, then closes with Done — but the corresponding test (test_reasoning_no_started_opens_and_closes_one_message) only asserts on the content delta, leaving the summary delta unverified on that path.

tests/lib/adk/test_codex_sync.py — the lazy-open reasoning test is missing the ReasoningSummaryDelta assertion that the normal-path test carries.

Important Files Changed

Filename Overview
src/agentex/lib/adk/_modules/_codex_sync.py Replaces Full(ReasoningContent) emission with Start+SummaryDelta+ContentDelta+Done sequence; adds lazy Start for no-started case and imports ReasoningSummaryDelta.
src/agentex/lib/adk/providers/_modules/sync_provider.py Fixes reasoning Start events to use ReasoningContent instead of TextContent for both reasoning_summary and reasoning_content paths; imports ReasoningContent.
tests/lib/adk/test_codex_sync.py Rewrites reasoning tests to assert Start+deltas+Done pattern and adds empty-block test; lazy-open path missing ReasoningSummaryDelta assertion.

Sequence Diagram

%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
    participant Codex as Codex API
    participant Proc as _CodexStreamProcessor
    participant Consumer as auto_send / Consumer

    Note over Proc,Consumer: BEFORE (broken)
    Codex->>Proc: item.started (reasoning)
    Proc->>Consumer: "StreamTaskMessageStart(ReasoningContent, active) @ idx"
    Codex->>Proc: "item.completed (reasoning, text=...)"
    Proc->>Consumer: "StreamTaskMessageFull(ReasoningContent, static) @ idx"
    Note over Consumer: auto_send ignores idx on Full — Start @ idx stays dangling until teardown — duplicate message persisted

    Note over Proc,Consumer: AFTER (fixed)
    Codex->>Proc: item.started (reasoning)
    Proc->>Consumer: "StreamTaskMessageStart(ReasoningContent, active) @ idx"
    Codex->>Proc: "item.completed (reasoning, text=...)"
    Proc->>Consumer: "StreamTaskMessageDelta(ReasoningSummaryDelta) @ idx"
    Proc->>Consumer: "StreamTaskMessageDelta(ReasoningContentDelta) @ idx"
    Proc->>Consumer: "StreamTaskMessageDone @ idx"
    Note over Consumer: Open context accumulates deltas and closes cleanly — one message persisted

    Note over Proc,Consumer: OpenAI fix (sync_provider.py)
    Proc->>Consumer: "StreamTaskMessageStart(ReasoningContent, was TextContent) @ idx"
    Note over Consumer: Consumers see correct type for thinking-indicator branch
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
    participant Codex as Codex API
    participant Proc as _CodexStreamProcessor
    participant Consumer as auto_send / Consumer

    Note over Proc,Consumer: BEFORE (broken)
    Codex->>Proc: item.started (reasoning)
    Proc->>Consumer: "StreamTaskMessageStart(ReasoningContent, active) @ idx"
    Codex->>Proc: "item.completed (reasoning, text=...)"
    Proc->>Consumer: "StreamTaskMessageFull(ReasoningContent, static) @ idx"
    Note over Consumer: auto_send ignores idx on Full — Start @ idx stays dangling until teardown — duplicate message persisted

    Note over Proc,Consumer: AFTER (fixed)
    Codex->>Proc: item.started (reasoning)
    Proc->>Consumer: "StreamTaskMessageStart(ReasoningContent, active) @ idx"
    Codex->>Proc: "item.completed (reasoning, text=...)"
    Proc->>Consumer: "StreamTaskMessageDelta(ReasoningSummaryDelta) @ idx"
    Proc->>Consumer: "StreamTaskMessageDelta(ReasoningContentDelta) @ idx"
    Proc->>Consumer: "StreamTaskMessageDone @ idx"
    Note over Consumer: Open context accumulates deltas and closes cleanly — one message persisted

    Note over Proc,Consumer: OpenAI fix (sync_provider.py)
    Proc->>Consumer: "StreamTaskMessageStart(ReasoningContent, was TextContent) @ idx"
    Note over Consumer: Consumers see correct type for thinking-indicator branch
Loading

Fix All in Claude Code

Prompt To Fix All With AI
Fix the following 1 code review issue. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 1
tests/lib/adk/test_codex_sync.py:441-449
**Lazy-open path missing `ReasoningSummaryDelta` assertion**

The production code for the `idx is None` case always emits a `ReasoningSummaryDelta` before the `ReasoningContentDelta` (lines 407–416 of `_codex_sync.py`). This test only asserts on `content_deltas` and ignores `summary_deltas`, so a future regression that accidentally drops the summary delta on this path would go undetected. Adding the same `summary_deltas` assertions that `test_reasoning_start_deltas_done` carries would close the gap.

Reviews (1): Last reviewed commit: "fix(tests): narrow reasoning delta types..." | Re-trigger Greptile

Greptile also left 1 inline comment on this PR.

declan-scale and others added 2 commits June 23, 2026 17:45
… duplicate message

Codex reasoning emitted Start(active) then Full(static) at the open index with
no Done. auto_send routes a Full into its own throwaway streaming context
(ignoring the index), so the Start context survived until end-of-turn teardown
and persisted a second, near-empty reasoning message (user-visible duplicate +
out-of-order).

Mirror the claude_code pattern: stream the final reasoning as summary + content
deltas on the open index, then close with a Done, so the open context
accumulates the final ReasoningContent and closes cleanly as one message. The
no-started case opens the Start lazily and closes it the same way.

Updated tests assert the Start + deltas + Done sequence and that no
Full(ReasoningContent) is emitted.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
convert_openai_to_agentex_events opened reasoning summary/content messages with
a TextContent Start. On the migrated auto_send/Temporal path this regressed the
prior behavior (which started reasoning with ReasoningContent), so consumers
branching on the start event's content type render reasoning as plain text. The
final persisted content is rebuilt from the reasoning deltas regardless, so this
only affects the live stream envelope. Aligns with the codex/claude_code taps
and the langgraph-sync converter, and matches what the openai conformance suite
already treats as canonical.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ng tests

pyright does not narrow the StreamTaskMessageDelta.delta union through an
isinstance filter inside a list comprehension, so accessing content_delta /
summary_delta on the collected elements failed. Bind each delta to a local and
assert isinstance before accessing the typed attribute.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Comment on lines +441 to +449
assert len(starts) == 1
assert isinstance(starts[0].content, ReasoningContent)
assert reasoning_fulls == []
assert len(content_deltas) == 1
content_delta = content_deltas[0].delta
assert isinstance(content_delta, ReasoningContentDelta)
assert content_delta.content_delta == "orphan thought"
assert len(dones) == 1
assert dones[0].index == starts[0].index

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Lazy-open path missing ReasoningSummaryDelta assertion

The production code for the idx is None case always emits a ReasoningSummaryDelta before the ReasoningContentDelta (lines 407–416 of _codex_sync.py). This test only asserts on content_deltas and ignores summary_deltas, so a future regression that accidentally drops the summary delta on this path would go undetected. Adding the same summary_deltas assertions that test_reasoning_start_deltas_done carries would close the gap.

Prompt To Fix With AI
This is a comment left during a code review.
Path: tests/lib/adk/test_codex_sync.py
Line: 441-449

Comment:
**Lazy-open path missing `ReasoningSummaryDelta` assertion**

The production code for the `idx is None` case always emits a `ReasoningSummaryDelta` before the `ReasoningContentDelta` (lines 407–416 of `_codex_sync.py`). This test only asserts on `content_deltas` and ignores `summary_deltas`, so a future regression that accidentally drops the summary delta on this path would go undetected. Adding the same `summary_deltas` assertions that `test_reasoning_start_deltas_done` carries would close the gap.

How can I resolve this? If you propose a fix, please make it concise.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Fix in Claude Code

@declan-scale declan-scale merged commit 1d86e8a into next Jun 23, 2026
55 checks passed
@declan-scale declan-scale deleted the declan-scale/fix-codex-openai-reasoning-stream branch June 23, 2026 22:01
@stainless-app stainless-app Bot mentioned this pull request Jun 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants