fix(harness): correct codex & openai reasoning stream envelopes#441
Merged
declan-scale merged 3 commits intoJun 23, 2026
Merged
Conversation
… duplicate message Codex reasoning emitted Start(active) then Full(static) at the open index with no Done. auto_send routes a Full into its own throwaway streaming context (ignoring the index), so the Start context survived until end-of-turn teardown and persisted a second, near-empty reasoning message (user-visible duplicate + out-of-order). Mirror the claude_code pattern: stream the final reasoning as summary + content deltas on the open index, then close with a Done, so the open context accumulates the final ReasoningContent and closes cleanly as one message. The no-started case opens the Start lazily and closes it the same way. Updated tests assert the Start + deltas + Done sequence and that no Full(ReasoningContent) is emitted. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
convert_openai_to_agentex_events opened reasoning summary/content messages with a TextContent Start. On the migrated auto_send/Temporal path this regressed the prior behavior (which started reasoning with ReasoningContent), so consumers branching on the start event's content type render reasoning as plain text. The final persisted content is rebuilt from the reasoning deltas regardless, so this only affects the live stream envelope. Aligns with the codex/claude_code taps and the langgraph-sync converter, and matches what the openai conformance suite already treats as canonical. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
danielmillerp
approved these changes
Jun 23, 2026
…ng tests pyright does not narrow the StreamTaskMessageDelta.delta union through an isinstance filter inside a list comprehension, so accessing content_delta / summary_delta on the collected elements failed. Bind each delta to a local and assert isinstance before accessing the typed attribute. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Comment on lines
+441
to
+449
| assert len(starts) == 1 | ||
| assert isinstance(starts[0].content, ReasoningContent) | ||
| assert reasoning_fulls == [] | ||
| assert len(content_deltas) == 1 | ||
| content_delta = content_deltas[0].delta | ||
| assert isinstance(content_delta, ReasoningContentDelta) | ||
| assert content_delta.content_delta == "orphan thought" | ||
| assert len(dones) == 1 | ||
| assert dones[0].index == starts[0].index |
There was a problem hiding this comment.
Lazy-open path missing
ReasoningSummaryDelta assertion
The production code for the idx is None case always emits a ReasoningSummaryDelta before the ReasoningContentDelta (lines 407–416 of _codex_sync.py). This test only asserts on content_deltas and ignores summary_deltas, so a future regression that accidentally drops the summary delta on this path would go undetected. Adding the same summary_deltas assertions that test_reasoning_start_deltas_done carries would close the gap.
Prompt To Fix With AI
This is a comment left during a code review.
Path: tests/lib/adk/test_codex_sync.py
Line: 441-449
Comment:
**Lazy-open path missing `ReasoningSummaryDelta` assertion**
The production code for the `idx is None` case always emits a `ReasoningSummaryDelta` before the `ReasoningContentDelta` (lines 407–416 of `_codex_sync.py`). This test only asserts on `content_deltas` and ignores `summary_deltas`, so a future regression that accidentally drops the summary delta on this path would go undetected. Adding the same `summary_deltas` assertions that `test_reasoning_start_deltas_done` carries would close the gap.
How can I resolve this? If you propose a fix, please make it concise.Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Fixes two reasoning-stream issues found while reviewing the unified-harness release (#424). Both are confined to the streaming envelope (the live
StreamTaskMessage*sequence); neither changes the persisted message type, but #1 produces a duplicate persisted message.1. Codex: duplicate / orphaned reasoning message (🔴 user-visible)
_codex_sync.pyemittedStart(ReasoningContent, active)thenFull(ReasoningContent, static)at the open index, with noDone.auto_sendhandlesFullby opening its own throwaway streaming context (it ignoresevent.index), so the originalStartcontext was never closed until end-of-turn teardown — persisting a second, near-empty reasoning message with a later timestamp (duplicate + out-of-order).Fix: mirror the working
_claude_code_sync.pypattern — stream the final reasoning asReasoningSummaryDelta+ReasoningContentDeltaon the open index, then close withDone. The open context accumulates the finalReasoningContentand closes cleanly as one message. The no-startedcase opens theStartlazily and closes it the same way; the empty-reasoning case still closes with a bareDone.2. OpenAI: reasoning
Startcontent type regressedReasoningContent→TextContent(🟠)convert_openai_to_agentex_eventsopened reasoning summary/content messages with aTextContentStart. On the migratedauto_send/Temporal path this regressed the pre-migration behavior (which started reasoning withReasoningContent), so any consumer that branches on the start event's content type to show a "thinking" indicator now sees plain text. The persisted content is rebuilt from the reasoning deltas regardless, so only the live envelope was affected.Fix: emit
ReasoningContent(style="active")for both reasoningStarts. This aligns the converter with the codex/claude_code taps, the (already-corrected) langgraph-sync converter, and what the OpenAI conformance suite already treats as canonical.Testing
Start+ summary/content deltas +Donesequence and that noFull(ReasoningContent)is emitted; added an empty-block-closes-with-Done case.tests/lib/adk/test_codex_sync.py,tests/lib/adk/providers/test_openai_turn.py,tests/lib/adk/test_codex_turn.py,tests/lib/adk/test_claude_code_sync.py,tests/lib/adk/providers/, and the fulltests/lib/core/harness/(incl. conformance) suites pass (203 + 87 green across runs).ruff format+ruff checkclean on all changed files.🤖 Generated with Claude Code
Greptile Summary
This PR fixes two streaming envelope bugs in the reasoning message pipeline: a Codex duplicate-message issue caused by mixing
Start+Full(which leaves the open context dangling viaauto_send's index-ignoring behaviour), and an OpenAI regression where reasoningStartevents were typed asTextContentinstead ofReasoningContent._codex_sync.py):item.completednow emitsReasoningSummaryDelta+ReasoningContentDelta+Doneon the already-open index, mirroring the_claude_code_sync.pypattern; a lazyStartis opened when noitem.startedpreceded it, and an empty block still closes cleanly with a bareDone.sync_provider.py): Bothreasoning_summaryandreasoning_contentStartevents now carryReasoningContent(style="active")so downstream consumers that branch on the start-event type correctly render a thinking indicator.test_codex_sync.py): Existing reasoning tests are rewritten to assert the fullStart → deltas → Donesequence and explicitly assert noFull(ReasoningContent)is emitted; a new test covers the empty-block case.Confidence Score: 4/5
Safe to merge — both fixes are well-scoped to the streaming envelope layer and don't touch persisted message types; the production logic is correct.
The Codex and OpenAI changes are straightforward and well-explained. The lazy-open path in
_codex_sync.pycorrectly opens aStart, emits both aReasoningSummaryDeltaand aReasoningContentDelta, then closes withDone— but the corresponding test (test_reasoning_no_started_opens_and_closes_one_message) only asserts on the content delta, leaving the summary delta unverified on that path.tests/lib/adk/test_codex_sync.py — the lazy-open reasoning test is missing the
ReasoningSummaryDeltaassertion that the normal-path test carries.Important Files Changed
Sequence Diagram
%%{init: {'theme': 'neutral'}}%% sequenceDiagram participant Codex as Codex API participant Proc as _CodexStreamProcessor participant Consumer as auto_send / Consumer Note over Proc,Consumer: BEFORE (broken) Codex->>Proc: item.started (reasoning) Proc->>Consumer: "StreamTaskMessageStart(ReasoningContent, active) @ idx" Codex->>Proc: "item.completed (reasoning, text=...)" Proc->>Consumer: "StreamTaskMessageFull(ReasoningContent, static) @ idx" Note over Consumer: auto_send ignores idx on Full — Start @ idx stays dangling until teardown — duplicate message persisted Note over Proc,Consumer: AFTER (fixed) Codex->>Proc: item.started (reasoning) Proc->>Consumer: "StreamTaskMessageStart(ReasoningContent, active) @ idx" Codex->>Proc: "item.completed (reasoning, text=...)" Proc->>Consumer: "StreamTaskMessageDelta(ReasoningSummaryDelta) @ idx" Proc->>Consumer: "StreamTaskMessageDelta(ReasoningContentDelta) @ idx" Proc->>Consumer: "StreamTaskMessageDone @ idx" Note over Consumer: Open context accumulates deltas and closes cleanly — one message persisted Note over Proc,Consumer: OpenAI fix (sync_provider.py) Proc->>Consumer: "StreamTaskMessageStart(ReasoningContent, was TextContent) @ idx" Note over Consumer: Consumers see correct type for thinking-indicator branch%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%% sequenceDiagram participant Codex as Codex API participant Proc as _CodexStreamProcessor participant Consumer as auto_send / Consumer Note over Proc,Consumer: BEFORE (broken) Codex->>Proc: item.started (reasoning) Proc->>Consumer: "StreamTaskMessageStart(ReasoningContent, active) @ idx" Codex->>Proc: "item.completed (reasoning, text=...)" Proc->>Consumer: "StreamTaskMessageFull(ReasoningContent, static) @ idx" Note over Consumer: auto_send ignores idx on Full — Start @ idx stays dangling until teardown — duplicate message persisted Note over Proc,Consumer: AFTER (fixed) Codex->>Proc: item.started (reasoning) Proc->>Consumer: "StreamTaskMessageStart(ReasoningContent, active) @ idx" Codex->>Proc: "item.completed (reasoning, text=...)" Proc->>Consumer: "StreamTaskMessageDelta(ReasoningSummaryDelta) @ idx" Proc->>Consumer: "StreamTaskMessageDelta(ReasoningContentDelta) @ idx" Proc->>Consumer: "StreamTaskMessageDone @ idx" Note over Consumer: Open context accumulates deltas and closes cleanly — one message persisted Note over Proc,Consumer: OpenAI fix (sync_provider.py) Proc->>Consumer: "StreamTaskMessageStart(ReasoningContent, was TextContent) @ idx" Note over Consumer: Consumers see correct type for thinking-indicator branchPrompt To Fix All With AI
Reviews (1): Last reviewed commit: "fix(tests): narrow reasoning delta types..." | Re-trigger Greptile