test(harness): integration-test parity for openai, claude-code, codex#433
Merged
Conversation
1 task
09c107a to
0175fb6
Compare
76361ef to
1c5dd14
Compare
Contributor
Author
|
@greptile review |
0175fb6 to
97a34fa
Compare
1c5dd14 to
86ce557
Compare
Contributor
Author
|
@greptile review |
97a34fa to
b01cfcc
Compare
86ce557 to
5f81fc4
Compare
Contributor
Author
|
@greptile review |
b01cfcc to
24d10ab
Compare
5f81fc4 to
7f0d754
Compare
24d10ab to
c17c9b3
Compare
7f0d754 to
b158aa2
Compare
danielmillerp
approved these changes
Jun 23, 2026
c17c9b3 to
ee560a1
Compare
b158aa2 to
561884d
Compare
ee560a1 to
fdf6187
Compare
561884d to
96141e0
Compare
4ea74ac to
1efd8dc
Compare
96141e0 to
b30a90b
Compare
Add offline sync/async/temporal integration suites for the openai, claude_code and codex harnesses (+76 tests), mirroring the existing langgraph/pydantic_ai coverage. Extend the harness-integration live-matrix to all five harnesses and switch the path trigger to a test_harness_*.py glob so new suites are picked up automatically. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Seventh slice of #425. Brings the new harnesses up to integration-test parity.
test_harness_{openai,claude_code,codex}_{sync,async,temporal}suites (+76 tests), using fake streams / TestModel + fake streaming/tracing — no live infrastructure required.harness-integrationlive-matrix to all five harnesses and switches the path trigger to atest_harness_*.pyglob so future suites are picked up automatically.Test plan
pytestthe 9 new suites — 76 passedNotes
Stacked on #432 (needs the OpenAI facade export). Retarget to
nextafter the chain merges.🤖 Generated with Claude Code
Greptile Summary
This PR brings the new OpenAI, claude-code, and Codex harnesses up to full integration-test parity by adding 76 offline tests across nine new suites and updating the CI matrix to cover all five harnesses automatically.
test_harness_{openai,claude_code,codex}_{sync,async,temporal}.py) exerciseUnifiedEmitter.yield_turnandauto_send_turnwith hand-built canonical event fixtures and fake streaming/tracing backends — no live infrastructure required.test_harness_*.pyand addsopenai,claude_code, andcodexto the live-matrix (3 harnesses × 3 channels = 9 new matrix jobs).Confidence Score: 5/5
All changes are additive test code and a CI workflow update; no production logic is touched.
Every new file is a pure test addition using offline fakes with no live infrastructure. The CI change is a straightforward matrix expansion plus a glob-broadening that correctly picks up any future test_harness_*.py suites. The tests follow the same patterns already established by the pydantic_ai and langgraph suites.
No files require special attention.
Important Files Changed
test_harness_*.pyand extends the live-matrix withopenai,claude_code, andcodex; clean and intentional.test_reasoning_span_opened_then_done_closedthat explicitly inspects OpenSpan/CloseSpan signals andis_complete.Flowchart
%%{init: {'theme': 'neutral'}}%% flowchart TD subgraph Harnesses["Harnesses under test"] OAI[OpenAI] CC[Claude Code] CDX[Codex] end subgraph Channels["Channels × 3"] SYNC[sync\nyield_turn] ASYNC[async\nauto_send_turn] TEMP[temporal\nauto_send_turn\n+ created_at] end subgraph Fakes["Offline fakes"] FS[_FakeStreaming] FT[FakeTracing] FIX[Event fixtures\nhand-built] end OAI --> SYNC OAI --> ASYNC OAI --> TEMP CC --> SYNC CC --> ASYNC CC --> TEMP CDX --> SYNC CDX --> ASYNC CDX --> TEMP SYNC -->|events| FIX ASYNC -->|events| FIX TEMP -->|events + created_at| FIX ASYNC --> FS TEMP --> FS SYNC --> FT ASYNC --> FT%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%% flowchart TD subgraph Harnesses["Harnesses under test"] OAI[OpenAI] CC[Claude Code] CDX[Codex] end subgraph Channels["Channels × 3"] SYNC[sync\nyield_turn] ASYNC[async\nauto_send_turn] TEMP[temporal\nauto_send_turn\n+ created_at] end subgraph Fakes["Offline fakes"] FS[_FakeStreaming] FT[FakeTracing] FIX[Event fixtures\nhand-built] end OAI --> SYNC OAI --> ASYNC OAI --> TEMP CC --> SYNC CC --> ASYNC CC --> TEMP CDX --> SYNC CDX --> ASYNC CDX --> TEMP SYNC -->|events| FIX ASYNC -->|events| FIX TEMP -->|events + created_at| FIX ASYNC --> FS TEMP --> FS SYNC --> FT ASYNC --> FTComments Outside Diff (3)
General comment
1 failed, 38 passed, 7 errors, rather than approximately 76 passing tests. The blocking failures are in changed tests and their imported changed harness code:tests/lib/core/harness/test_harness_openai_sync.pyfails collection due to missingscale_gp; all three Claude Code suites fail collection viasrc/agentex/lib/adk/_modules/_claude_code_turn.py:159assertingClaudeCodeTurndoes not satisfyHarnessTurn; andtests/lib/core/harness/test_harness_openai_async.py:251fails importingagentex.lib.adk._modules._openai_turnvia the package attribute path. This is a contract mismatch with the PR validation objective that the new suites run offline and pass.ClaudeCodeTurn's runtime protocol assertion does not hold in the executed environment. The OpenAI async test also relies on importingagentex.lib.adkas an attribute ofagentex.lib, which failed after module loading in the test run.ClaudeCodeTurnso itseventsproperty andusage()method satisfyHarnessTurnat runtime; and adjust the OpenAI async test/import path soagentex.lib.adk._modules._openai_turnis importable consistently. Re-run the exact nine paths until they collect the expected test count and pass without live credentials/services.General comment
pytest --collect-onlyaborts while importing several changed test files because they dofrom typing import ... override. Python 3.11 does not exposetyping.override, so openai sync, openai async via imported source, claude_code sync, and codex sync collect 0 tests. This prevents the nine-suite parity claim from being true in the repository's declared Python 3.11 environment.overridefromtyping, which is only available in newer Python versions; this repository declares Python 3.11 support and the validation environment is Python 3.11.6.overridefromtyping_extensionsinstead oftypingin the changed tests and any imported source modules that must run under Python 3.11.General comment
test_harness_claude_code_async.pyandtest_harness_claude_code_temporal.pyfail during collection because importingagentex.lib.adk._modules._claude_code_turnraisesAssertionError: ClaudeCodeTurn must satisfy the HarnessTurn protocol. As a result, both changed pairings collect 0 tests._claude_code_turn.pyfails before pytest can collect the new claude_code async/temporal tests.ClaudeCodeTurnto satisfy theHarnessTurnruntime protocol or remove/defer the import-time assertion so test modules can be imported and collected.Reviews (11): Last reviewed commit: "fix(test): update codex reasoning span e..." | Re-trigger Greptile