Fix draft-run SSE accepted timing#740
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: fe04faa564
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if (onAcceptedAsync != null) | ||
| await onAcceptedAsync(receipt, ct); | ||
|
|
||
| await _dispatchPipeline.DispatchPreparedAsync(execution, ct); |
There was a problem hiding this comment.
Emit accepted only after dispatch admission
onAcceptedAsync now runs before DispatchPreparedAsync, so SSE/WebSocket clients can receive a run-started/accepted ack even when dispatch later fails (or is canceled) and the command never reaches the actor inbox. In HandleChat, this also starts the stream early, so failures after that point can no longer be returned as normal start errors and become stream-error frames instead, leaving clients with a false positive “accepted” state.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Thanks for calling this out. For this endpoint the early frame is intentional: after PrepareAsync succeeds we have a stable command/run id and can start the SSE stream before the potentially long actor/LLM dispatch path. Moving onAcceptedAsync back after DispatchPreparedAsync would reintroduce the zero-frame/late-accepted behavior this PR is fixing. If DispatchPreparedAsync fails after that point, the honest outcome should be a stream error/terminal frame rather than a pre-stream start error. I also rechecked the follow-up projector change so committed completions no longer replay content when ContentEmitted=true; they only synthesize the missing terminal frames.
Codecov Report❌ Patch coverage is @@ Coverage Diff @@
## dev #740 +/- ##
==========================================
+ Coverage 82.48% 82.52% +0.03%
==========================================
Files 941 941
Lines 60101 60270 +169
Branches 7872 7890 +18
==========================================
+ Hits 49575 49737 +162
- Misses 7131 7135 +4
- Partials 3395 3398 +3
Flags with carried forward coverage won't be shown. Click here to find out more.
... and 1 file with indirect coverage changes 🚀 New features to boost your workflow:
|
|
Review/runtime follow-up:
Validation run:
|
|
补充正常链路实测结果:这次不是只测异常链路,已在本地 distribute Host 下把 LLM provider 指到本地 OpenAI-compatible mock server,并重新调用 draft-run SSE。\n\n请求:\n |
|
补充问题 1 的本地复现与修复结果:\n\n复现方式:走 Studio member API,创建 member 后用 issue 中这组 GAgent binding 参数调用:\n |
Respect the committed completion content-emitted flag so recovery only synthesizes terminal frames after live content has already been streamed. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Follow-up pushed in What changed:
Validation:
|
| if (onAcceptedAsync != null) | ||
| await onAcceptedAsync(receipt, ct); | ||
|
|
||
| await _dispatchPipeline.DispatchPreparedAsync(execution, ct); |
There was a problem hiding this comment.
This moves accepted/runStarted before dispatch. For GAgent draft-run, onAcceptedAsync starts the SSE response and sets ResponseStarted to true. If DispatchPreparedAsync or the actor handler throws afterward, the endpoint exception path skips prepared actor rollback because ResponseStarted == true.
This affects draft-run requests that create a new actor. GAgentDraftRunActorPreparationService marks newly created actors with RequiresRollbackOnFailure: true, but after early runStarted, the failure path may no longer unregister/destroy that temporary actor, leaving the actor and registry entry behind. Reusing an existing actor is not affected.
There was a problem hiding this comment.
Good catch — this was a real issue. I pushed dbafc05 to address it.
What changed:
- draft-run exception/timeout/client-disconnect paths now call RollbackPreparedActorAsync even after the SSE response has started
- rollback remains gated by RequiresRollbackOnFailure, so existing actor reuse is not affected
- added coverage for the case where runStarted is emitted first and a later failure still rolls back the prepared temporary actor
Validation:
- dotnet test test/Aevatar.GAgentService.Integration.Tests/Aevatar.GAgentService.Integration.Tests.csproj --nologo --no-restore --filter "FullyQualifiedName~ScopeServiceEndpointsStreamTests" — passed 19/19
- git diff --check — passed
Ensure temporary draft-run actors are cleaned up even when an accepted SSE frame has already been sent and a later dispatch or execution failure occurs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add focused stream projector tests for ignored envelopes, live terminal frame synthesis, runFinished id completion, and committed completion failure/empty-content paths. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…kend-issues' into fix/2026-05-20_gagent-member-backend-issues
Summary
Issue #736 scope
This PR fixes issue #736 problem 2. I also tried to reproduce problem 1 locally in distributed mode, including the PR #702 GAgent member payload, but the member binding path returned 202 and the binding run succeeded; missing members were asynchronously rejected with STUDIO_MEMBER_NOT_FOUND. No 500 was reproduced locally, so this PR intentionally does not claim to fix problem 1.
Verification
StudioMemberEndpointsTests|FullyQualifiedNameScopeBindingStudioMemberPlatformBindingCommandServiceTests"ScopeGAgentAguiEventMapperTests|FullyQualifiedNameGAgentDraftRunInteractionCoverageTests"ScopeServiceEndpointsStreamTests|FullyQualifiedNameScopeGAgentEndpointsTests"Local runtime check
Started Mainnet Host in distributed mode against local Kafka/Garnet/Elasticsearch with Orleans/Kafka/Garnet runtime configuration and called POST /api/scopes/{scopeId}/gagent/draft-run. Neo4j still reported an auth failure in Development and was ignored by startup, matching the earlier local setup.
The SSE response now produces a real downstream terminal event, not just an initial runStarted frame and not a generic timeout. With no local NyxID auth token, the observed frames were: