test(harness): integration-test parity for openai, claude-code, codex by declan-scale · Pull Request #433 · scaleapi/scale-agentex-python

declan-scale · 2026-06-23T14:02:46Z

Summary

Seventh slice of #425. Brings the new harnesses up to integration-test parity.

Adds offline test_harness_{openai,claude_code,codex}_{sync,async,temporal} suites (+76 tests), using fake streams / TestModel + fake streaming/tracing — no live infrastructure required.
Extends the harness-integration live-matrix to all five harnesses and switches the path trigger to a test_harness_*.py glob so future suites are picked up automatically.

Test plan

pytest the 9 new suites — 76 passed

Notes

Stacked on #432 (needs the OpenAI facade export). Retarget to next after the chain merges.

🤖 Generated with Claude Code

Greptile Summary

This PR brings the new OpenAI, claude-code, and Codex harnesses up to full integration-test parity by adding 76 offline tests across nine new suites and updating the CI matrix to cover all five harnesses automatically.

9 new test files (test_harness_{openai,claude_code,codex}_{sync,async,temporal}.py) exercise UnifiedEmitter.yield_turn and auto_send_turn with hand-built canonical event fixtures and fake streaming/tracing backends — no live infrastructure required.
CI workflow generalises the path trigger from two explicit glob patterns to test_harness_*.py and adds openai, claude_code, and codex to the live-matrix (3 harnesses × 3 channels = 9 new matrix jobs).

Confidence Score: 5/5

All changes are additive test code and a CI workflow update; no production logic is touched.

Every new file is a pure test addition using offline fakes with no live infrastructure. The CI change is a straightforward matrix expansion plus a glob-broadening that correctly picks up any future test_harness_*.py suites. The tests follow the same patterns already established by the pydantic_ai and langgraph suites.

No files require special attention.

Important Files Changed

Filename	Overview
.github/workflows/harness-integration.yml	Broadens path trigger glob from harness-specific patterns to `test_harness_*.py` and extends the live-matrix with `openai`, `claude_code`, and `codex`; clean and intentional.
tests/lib/core/harness/test_harness_openai_async.py	Async harness suite for OpenAI; covers message ordering, content verification, usage from result path (with monkeypatch), and span derivation using a fake streaming backend.
tests/lib/core/harness/test_harness_openai_sync.py	Sync harness suite for OpenAI; exercises yield_turn path with canonical stream, reasoning span derivation, and Start/Done index matching.
tests/lib/core/harness/test_harness_claude_code_sync.py	Sync harness suite for claude-code; covers tool/text event ordering, reasoning span for thinking blocks, tracer suppression, and Start/Done index matching.
tests/lib/core/harness/test_harness_codex_sync.py	Sync harness suite for codex; includes a thorough `test_reasoning_span_opened_then_done_closed` that explicitly inspects OpenSpan/CloseSpan signals and `is_complete`.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    subgraph Harnesses["Harnesses under test"]
        OAI[OpenAI]
        CC[Claude Code]
        CDX[Codex]
    end

    subgraph Channels["Channels × 3"]
        SYNC[sync\nyield_turn]
        ASYNC[async\nauto_send_turn]
        TEMP[temporal\nauto_send_turn\n+ created_at]
    end

    subgraph Fakes["Offline fakes"]
        FS[_FakeStreaming]
        FT[FakeTracing]
        FIX[Event fixtures\nhand-built]
    end

    OAI --> SYNC
    OAI --> ASYNC
    OAI --> TEMP
    CC --> SYNC
    CC --> ASYNC
    CC --> TEMP
    CDX --> SYNC
    CDX --> ASYNC
    CDX --> TEMP

    SYNC -->|events| FIX
    ASYNC -->|events| FIX
    TEMP -->|events + created_at| FIX

    ASYNC --> FS
    TEMP --> FS
    SYNC --> FT
    ASYNC --> FT

%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
    subgraph Harnesses["Harnesses under test"]
        OAI[OpenAI]
        CC[Claude Code]
        CDX[Codex]
    end

    subgraph Channels["Channels × 3"]
        SYNC[sync\nyield_turn]
        ASYNC[async\nauto_send_turn]
        TEMP[temporal\nauto_send_turn\n+ created_at]
    end

    subgraph Fakes["Offline fakes"]
        FS[_FakeStreaming]
        FT[FakeTracing]
        FIX[Event fixtures\nhand-built]
    end

    OAI --> SYNC
    OAI --> ASYNC
    OAI --> TEMP
    CC --> SYNC
    CC --> ASYNC
    CC --> TEMP
    CDX --> SYNC
    CDX --> ASYNC
    CDX --> TEMP

    SYNC -->|events| FIX
    ASYNC -->|events| FIX
    TEMP -->|events + created_at| FIX

    ASYNC --> FS
    TEMP --> FS
    SYNC --> FT
    ASYNC --> FT

Comments Outside Diff (3)

General comment

Nine offline harness suites do not run cleanly on head and collect only 39 tests instead of the claimed 76
- Bug
  - Running the exact nine target suite paths on head does not produce the claimed all-passing offline integration coverage. Pytest collects 39 items and reports 1 failed, 38 passed, 7 errors, rather than approximately 76 passing tests. The blocking failures are in changed tests and their imported changed harness code: tests/lib/core/harness/test_harness_openai_sync.py fails collection due to missing scale_gp; all three Claude Code suites fail collection via src/agentex/lib/adk/_modules/_claude_code_turn.py:159 asserting ClaudeCodeTurn does not satisfy HarnessTurn; and tests/lib/core/harness/test_harness_openai_async.py:251 fails importing agentex.lib.adk._modules._openai_turn via the package attribute path. This is a contract mismatch with the PR validation objective that the new suites run offline and pass.
- Cause
  - The new test suites import broad ADK modules that require extra dependencies/import paths during collection, and ClaudeCodeTurn's runtime protocol assertion does not hold in the executed environment. The OpenAI async test also relies on importing agentex.lib.adk as an attribute of agentex.lib, which failed after module loading in the test run.
- Fix
  - Ensure the nine offline suites are self-contained under the documented test invocation: add/move required test dependencies into the runnable test environment or avoid importing infrastructure-heavy ADK modules during offline tests; fix ClaudeCodeTurn so its events property and usage() method satisfy HarnessTurn at runtime; and adjust the OpenAI async test/import path so agentex.lib.adk._modules._openai_turn is importable consistently. Re-run the exact nine paths until they collect the expected test count and pass without live credentials/services.
_{Ran code and verified through T-Rex}
General comment

Changed harness tests use typing.override, breaking collection on supported Python 3.11
- Bug
  - On head, pytest --collect-only aborts while importing several changed test files because they do from typing import ... override. Python 3.11 does not expose typing.override, so openai sync, openai async via imported source, claude_code sync, and codex sync collect 0 tests. This prevents the nine-suite parity claim from being true in the repository's declared Python 3.11 environment.
- Cause
  - The new tests/source import override from typing, which is only available in newer Python versions; this repository declares Python 3.11 support and the validation environment is Python 3.11.6.
- Fix
  - Import override from typing_extensions instead of typing in the changed tests and any imported source modules that must run under Python 3.11.
_{Ran code and verified through T-Rex}
General comment

Claude Code harness import-time protocol assertion prevents async and temporal suite collection
- Bug
  - After Python import setup reaches the Claude Code suites, both test_harness_claude_code_async.py and test_harness_claude_code_temporal.py fail during collection because importing agentex.lib.adk._modules._claude_code_turn raises AssertionError: ClaudeCodeTurn must satisfy the HarnessTurn protocol. As a result, both changed pairings collect 0 tests.
- Cause
  - An import-time runtime protocol assertion in _claude_code_turn.py fails before pytest can collect the new claude_code async/temporal tests.
- Fix
  - Update ClaudeCodeTurn to satisfy the HarnessTurn runtime protocol or remove/defer the import-time assertion so test modules can be imported and collected.
_{Ran code and verified through T-Rex}

_{Reviews (11): Last reviewed commit: "fix(test): update codex reasoning span e..." | Re-trigger Greptile}

declan-scale · 2026-06-23T15:34:40Z

@greptile review

declan-scale · 2026-06-23T16:29:47Z

@greptile review

declan-scale · 2026-06-23T16:52:31Z

@greptile review

Add offline sync/async/temporal integration suites for the openai, claude_code and codex harnesses (+76 tests), mirroring the existing langgraph/pydantic_ai coverage. Extend the harness-integration live-matrix to all five harnesses and switch the path trigger to a test_harness_*.py glob so new suites are picked up automatically. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

declan-scale mentioned this pull request Jun 23, 2026

feat(cli): add default-openai-agents init template (async base) #434

Merged

1 task

declan-scale force-pushed the declan-scale/harness-openai-modules branch from 09c107a to 0175fb6 Compare June 23, 2026 15:27

declan-scale force-pushed the declan-scale/harness-integration-tests branch from 76361ef to 1c5dd14 Compare June 23, 2026 15:28

declan-scale force-pushed the declan-scale/harness-openai-modules branch from 0175fb6 to 97a34fa Compare June 23, 2026 15:43

declan-scale force-pushed the declan-scale/harness-integration-tests branch from 1c5dd14 to 86ce557 Compare June 23, 2026 15:44

declan-scale force-pushed the declan-scale/harness-openai-modules branch from 97a34fa to b01cfcc Compare June 23, 2026 16:51

declan-scale force-pushed the declan-scale/harness-integration-tests branch from 86ce557 to 5f81fc4 Compare June 23, 2026 16:51

declan-scale force-pushed the declan-scale/harness-openai-modules branch from b01cfcc to 24d10ab Compare June 23, 2026 19:53

declan-scale force-pushed the declan-scale/harness-integration-tests branch from 5f81fc4 to 7f0d754 Compare June 23, 2026 19:53

declan-scale force-pushed the declan-scale/harness-openai-modules branch from 24d10ab to c17c9b3 Compare June 23, 2026 19:56

declan-scale force-pushed the declan-scale/harness-integration-tests branch from 7f0d754 to b158aa2 Compare June 23, 2026 19:56

danielmillerp approved these changes Jun 23, 2026

View reviewed changes

declan-scale force-pushed the declan-scale/harness-openai-modules branch from c17c9b3 to ee560a1 Compare June 23, 2026 22:04

declan-scale force-pushed the declan-scale/harness-integration-tests branch from b158aa2 to 561884d Compare June 23, 2026 22:04

declan-scale force-pushed the declan-scale/harness-openai-modules branch from ee560a1 to fdf6187 Compare June 23, 2026 22:29

declan-scale force-pushed the declan-scale/harness-integration-tests branch from 561884d to 96141e0 Compare June 23, 2026 22:29

greptile-apps Bot reviewed Jun 23, 2026

View reviewed changes

Comment thread tests/lib/core/harness/test_harness_codex_sync.py Outdated

declan-scale force-pushed the declan-scale/harness-openai-modules branch from 4ea74ac to 1efd8dc Compare June 23, 2026 23:40

Base automatically changed from declan-scale/harness-openai-modules to next June 23, 2026 23:46

declan-scale closed this Jun 23, 2026

declan-scale force-pushed the declan-scale/harness-integration-tests branch from 96141e0 to b30a90b Compare June 23, 2026 23:47

declan-scale reopened this Jun 23, 2026

fix(test): update codex reasoning span expectation

8a0cf6b

declan-scale merged commit ce438e4 into next Jun 24, 2026
64 checks passed

declan-scale deleted the declan-scale/harness-integration-tests branch June 24, 2026 00:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

test(harness): integration-test parity for openai, claude-code, codex#433

test(harness): integration-test parity for openai, claude-code, codex#433
declan-scale merged 2 commits into
nextfrom
declan-scale/harness-integration-tests

declan-scale commented Jun 23, 2026 •

edited by greptile-apps Bot

Loading

Uh oh!

declan-scale commented Jun 23, 2026

Uh oh!

declan-scale commented Jun 23, 2026

Uh oh!

declan-scale commented Jun 23, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

declan-scale commented Jun 23, 2026 • edited by greptile-apps Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Notes

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Comments Outside Diff (3)

Uh oh!

declan-scale commented Jun 23, 2026

Uh oh!

declan-scale commented Jun 23, 2026

Uh oh!

declan-scale commented Jun 23, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

declan-scale commented Jun 23, 2026 •

edited by greptile-apps Bot

Loading