Skip to content

test(harness): integration-test parity for openai, claude-code, codex#433

Merged
declan-scale merged 2 commits into
nextfrom
declan-scale/harness-integration-tests
Jun 24, 2026
Merged

test(harness): integration-test parity for openai, claude-code, codex#433
declan-scale merged 2 commits into
nextfrom
declan-scale/harness-integration-tests

Conversation

@declan-scale

@declan-scale declan-scale commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Summary

Seventh slice of #425. Brings the new harnesses up to integration-test parity.

  • Adds offline test_harness_{openai,claude_code,codex}_{sync,async,temporal} suites (+76 tests), using fake streams / TestModel + fake streaming/tracing — no live infrastructure required.
  • Extends the harness-integration live-matrix to all five harnesses and switches the path trigger to a test_harness_*.py glob so future suites are picked up automatically.

Test plan

  • pytest the 9 new suites — 76 passed

Notes

Stacked on #432 (needs the OpenAI facade export). Retarget to next after the chain merges.

🤖 Generated with Claude Code

Greptile Summary

This PR brings the new OpenAI, claude-code, and Codex harnesses up to full integration-test parity by adding 76 offline tests across nine new suites and updating the CI matrix to cover all five harnesses automatically.

  • 9 new test files (test_harness_{openai,claude_code,codex}_{sync,async,temporal}.py) exercise UnifiedEmitter.yield_turn and auto_send_turn with hand-built canonical event fixtures and fake streaming/tracing backends — no live infrastructure required.
  • CI workflow generalises the path trigger from two explicit glob patterns to test_harness_*.py and adds openai, claude_code, and codex to the live-matrix (3 harnesses × 3 channels = 9 new matrix jobs).

Confidence Score: 5/5

All changes are additive test code and a CI workflow update; no production logic is touched.

Every new file is a pure test addition using offline fakes with no live infrastructure. The CI change is a straightforward matrix expansion plus a glob-broadening that correctly picks up any future test_harness_*.py suites. The tests follow the same patterns already established by the pydantic_ai and langgraph suites.

No files require special attention.

Important Files Changed

Filename Overview
.github/workflows/harness-integration.yml Broadens path trigger glob from harness-specific patterns to test_harness_*.py and extends the live-matrix with openai, claude_code, and codex; clean and intentional.
tests/lib/core/harness/test_harness_openai_async.py Async harness suite for OpenAI; covers message ordering, content verification, usage from result path (with monkeypatch), and span derivation using a fake streaming backend.
tests/lib/core/harness/test_harness_openai_sync.py Sync harness suite for OpenAI; exercises yield_turn path with canonical stream, reasoning span derivation, and Start/Done index matching.
tests/lib/core/harness/test_harness_claude_code_sync.py Sync harness suite for claude-code; covers tool/text event ordering, reasoning span for thinking blocks, tracer suppression, and Start/Done index matching.
tests/lib/core/harness/test_harness_codex_sync.py Sync harness suite for codex; includes a thorough test_reasoning_span_opened_then_done_closed that explicitly inspects OpenSpan/CloseSpan signals and is_complete.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    subgraph Harnesses["Harnesses under test"]
        OAI[OpenAI]
        CC[Claude Code]
        CDX[Codex]
    end

    subgraph Channels["Channels × 3"]
        SYNC[sync\nyield_turn]
        ASYNC[async\nauto_send_turn]
        TEMP[temporal\nauto_send_turn\n+ created_at]
    end

    subgraph Fakes["Offline fakes"]
        FS[_FakeStreaming]
        FT[FakeTracing]
        FIX[Event fixtures\nhand-built]
    end

    OAI --> SYNC
    OAI --> ASYNC
    OAI --> TEMP
    CC --> SYNC
    CC --> ASYNC
    CC --> TEMP
    CDX --> SYNC
    CDX --> ASYNC
    CDX --> TEMP

    SYNC -->|events| FIX
    ASYNC -->|events| FIX
    TEMP -->|events + created_at| FIX

    ASYNC --> FS
    TEMP --> FS
    SYNC --> FT
    ASYNC --> FT
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
    subgraph Harnesses["Harnesses under test"]
        OAI[OpenAI]
        CC[Claude Code]
        CDX[Codex]
    end

    subgraph Channels["Channels × 3"]
        SYNC[sync\nyield_turn]
        ASYNC[async\nauto_send_turn]
        TEMP[temporal\nauto_send_turn\n+ created_at]
    end

    subgraph Fakes["Offline fakes"]
        FS[_FakeStreaming]
        FT[FakeTracing]
        FIX[Event fixtures\nhand-built]
    end

    OAI --> SYNC
    OAI --> ASYNC
    OAI --> TEMP
    CC --> SYNC
    CC --> ASYNC
    CC --> TEMP
    CDX --> SYNC
    CDX --> ASYNC
    CDX --> TEMP

    SYNC -->|events| FIX
    ASYNC -->|events| FIX
    TEMP -->|events + created_at| FIX

    ASYNC --> FS
    TEMP --> FS
    SYNC --> FT
    ASYNC --> FT
Loading

Comments Outside Diff (3)

  1. General comment

    P1 Nine offline harness suites do not run cleanly on head and collect only 39 tests instead of the claimed 76

    • Bug
      • Running the exact nine target suite paths on head does not produce the claimed all-passing offline integration coverage. Pytest collects 39 items and reports 1 failed, 38 passed, 7 errors, rather than approximately 76 passing tests. The blocking failures are in changed tests and their imported changed harness code: tests/lib/core/harness/test_harness_openai_sync.py fails collection due to missing scale_gp; all three Claude Code suites fail collection via src/agentex/lib/adk/_modules/_claude_code_turn.py:159 asserting ClaudeCodeTurn does not satisfy HarnessTurn; and tests/lib/core/harness/test_harness_openai_async.py:251 fails importing agentex.lib.adk._modules._openai_turn via the package attribute path. This is a contract mismatch with the PR validation objective that the new suites run offline and pass.
    • Cause
      • The new test suites import broad ADK modules that require extra dependencies/import paths during collection, and ClaudeCodeTurn's runtime protocol assertion does not hold in the executed environment. The OpenAI async test also relies on importing agentex.lib.adk as an attribute of agentex.lib, which failed after module loading in the test run.
    • Fix
      • Ensure the nine offline suites are self-contained under the documented test invocation: add/move required test dependencies into the runnable test environment or avoid importing infrastructure-heavy ADK modules during offline tests; fix ClaudeCodeTurn so its events property and usage() method satisfy HarnessTurn at runtime; and adjust the OpenAI async test/import path so agentex.lib.adk._modules._openai_turn is importable consistently. Re-run the exact nine paths until they collect the expected test count and pass without live credentials/services.

    T-Rex Ran code and verified through T-Rex

  2. General comment

    P1 Changed harness tests use typing.override, breaking collection on supported Python 3.11

    • Bug
      • On head, pytest --collect-only aborts while importing several changed test files because they do from typing import ... override. Python 3.11 does not expose typing.override, so openai sync, openai async via imported source, claude_code sync, and codex sync collect 0 tests. This prevents the nine-suite parity claim from being true in the repository's declared Python 3.11 environment.
    • Cause
      • The new tests/source import override from typing, which is only available in newer Python versions; this repository declares Python 3.11 support and the validation environment is Python 3.11.6.
    • Fix
      • Import override from typing_extensions instead of typing in the changed tests and any imported source modules that must run under Python 3.11.

    T-Rex Ran code and verified through T-Rex

  3. General comment

    P1 Claude Code harness import-time protocol assertion prevents async and temporal suite collection

    • Bug
      • After Python import setup reaches the Claude Code suites, both test_harness_claude_code_async.py and test_harness_claude_code_temporal.py fail during collection because importing agentex.lib.adk._modules._claude_code_turn raises AssertionError: ClaudeCodeTurn must satisfy the HarnessTurn protocol. As a result, both changed pairings collect 0 tests.
    • Cause
      • An import-time runtime protocol assertion in _claude_code_turn.py fails before pytest can collect the new claude_code async/temporal tests.
    • Fix
      • Update ClaudeCodeTurn to satisfy the HarnessTurn runtime protocol or remove/defer the import-time assertion so test modules can be imported and collected.

    T-Rex Ran code and verified through T-Rex

Reviews (11): Last reviewed commit: "fix(test): update codex reasoning span e..." | Re-trigger Greptile

@declan-scale declan-scale force-pushed the declan-scale/harness-openai-modules branch from 09c107a to 0175fb6 Compare June 23, 2026 15:27
@declan-scale declan-scale force-pushed the declan-scale/harness-integration-tests branch from 76361ef to 1c5dd14 Compare June 23, 2026 15:28
@declan-scale

Copy link
Copy Markdown
Contributor Author

@greptile review

@declan-scale declan-scale force-pushed the declan-scale/harness-openai-modules branch from 0175fb6 to 97a34fa Compare June 23, 2026 15:43
@declan-scale declan-scale force-pushed the declan-scale/harness-integration-tests branch from 1c5dd14 to 86ce557 Compare June 23, 2026 15:44
@declan-scale

Copy link
Copy Markdown
Contributor Author

@greptile review

@declan-scale declan-scale force-pushed the declan-scale/harness-openai-modules branch from 97a34fa to b01cfcc Compare June 23, 2026 16:51
@declan-scale declan-scale force-pushed the declan-scale/harness-integration-tests branch from 86ce557 to 5f81fc4 Compare June 23, 2026 16:51
@declan-scale

Copy link
Copy Markdown
Contributor Author

@greptile review

@declan-scale declan-scale force-pushed the declan-scale/harness-openai-modules branch from b01cfcc to 24d10ab Compare June 23, 2026 19:53
@declan-scale declan-scale force-pushed the declan-scale/harness-integration-tests branch from 5f81fc4 to 7f0d754 Compare June 23, 2026 19:53
@declan-scale declan-scale force-pushed the declan-scale/harness-openai-modules branch from 24d10ab to c17c9b3 Compare June 23, 2026 19:56
@declan-scale declan-scale force-pushed the declan-scale/harness-integration-tests branch from 7f0d754 to b158aa2 Compare June 23, 2026 19:56
@declan-scale declan-scale force-pushed the declan-scale/harness-openai-modules branch from c17c9b3 to ee560a1 Compare June 23, 2026 22:04
@declan-scale declan-scale force-pushed the declan-scale/harness-integration-tests branch from b158aa2 to 561884d Compare June 23, 2026 22:04
@declan-scale declan-scale force-pushed the declan-scale/harness-openai-modules branch from ee560a1 to fdf6187 Compare June 23, 2026 22:29
@declan-scale declan-scale force-pushed the declan-scale/harness-integration-tests branch from 561884d to 96141e0 Compare June 23, 2026 22:29
Comment thread tests/lib/core/harness/test_harness_codex_sync.py Outdated
@declan-scale declan-scale force-pushed the declan-scale/harness-openai-modules branch from 4ea74ac to 1efd8dc Compare June 23, 2026 23:40
Base automatically changed from declan-scale/harness-openai-modules to next June 23, 2026 23:46
@declan-scale declan-scale force-pushed the declan-scale/harness-integration-tests branch from 96141e0 to b30a90b Compare June 23, 2026 23:47
Add offline sync/async/temporal integration suites for the openai, claude_code
and codex harnesses (+76 tests), mirroring the existing langgraph/pydantic_ai
coverage. Extend the harness-integration live-matrix to all five harnesses and
switch the path trigger to a test_harness_*.py glob so new suites are picked up
automatically.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@declan-scale declan-scale reopened this Jun 23, 2026
@declan-scale declan-scale merged commit ce438e4 into next Jun 24, 2026
64 checks passed
@declan-scale declan-scale deleted the declan-scale/harness-integration-tests branch June 24, 2026 00:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants