Fix draft-run SSE accepted timing by louis4li · Pull Request #740 · aevatarAI/aevatar

louis4li · 2026-05-20T07:21:35Z

Summary

emit the draft-run accepted/runStarted frame after command preparation and before awaiting prepared dispatch
keep stream pumping after dispatch admission so long-running actor/LLM execution no longer leaves SSE clients with an open zero-frame response
map committed RoleChatSessionCompletedEvent terminal facts through the draft-run projection session so SSE receives real downstream terminal frames
return terminal provider/LLM failures as AG-UI runError frames instead of falling back to a generic endpoint timeout

Issue #736 scope

This PR fixes issue #736 problem 2. I also tried to reproduce problem 1 locally in distributed mode, including the PR #702 GAgent member payload, but the member binding path returned 202 and the binding run succeeded; missing members were asynchronously rejected with STUDIO_MEMBER_NOT_FOUND. No 500 was reproduced locally, so this PR intentionally does not claim to fix problem 1.

Verification

dotnet test test/Aevatar.Workflow.Application.Tests/Aevatar.Workflow.Application.Tests.csproj --nologo --filter "FullyQualifiedName~WorkflowApplicationLayerTests"
dotnet test test/Aevatar.Studio.Tests/Aevatar.Studio.Tests.csproj --nologo --filter "FullyQualifiedName~~StudioMemberEndpointsTests|FullyQualifiedName~~ScopeBindingStudioMemberPlatformBindingCommandServiceTests"
dotnet test test/Aevatar.GAgentService.Tests/Aevatar.GAgentService.Tests.csproj --nologo --filter "FullyQualifiedName~~ScopeGAgentAguiEventMapperTests|FullyQualifiedName~~GAgentDraftRunInteractionCoverageTests"
dotnet test test/Aevatar.GAgentService.Integration.Tests/Aevatar.GAgentService.Integration.Tests.csproj --nologo --filter "FullyQualifiedName~~ScopeServiceEndpointsStreamTests|FullyQualifiedName~~ScopeGAgentEndpointsTests"
bash tools/ci/test_stability_guards.sh
bash tools/ci/query_projection_priming_guard.sh
bash tools/ci/projection_state_version_guard.sh
bash tools/ci/projection_state_mirror_current_state_guard.sh
dotnet build aevatar.slnx --nologo

Local runtime check

Started Mainnet Host in distributed mode against local Kafka/Garnet/Elasticsearch with Orleans/Kafka/Garnet runtime configuration and called POST /api/scopes/{scopeId}/gagent/draft-run. Neo4j still reported an auth failure in Development and was ignored by startup, matching the earlier local setup.

The SSE response now produces a real downstream terminal event, not just an initial runStarted frame and not a generic timeout. With no local NyxID auth token, the observed frames were:

data: { "runStarted": { "threadId": "Role:ce480c92", "runId": "27699f500f124d3c877d9db5f2ce20a4" } }

data: { "runError": { "message": "LLM request failed [tools=...]: NyxID authentication required for provider 'nyxid'. Please sign in." } }\n```\n\nSo the local unauthenticated environment cannot prove a successful `pong` text response, but it does prove the actor/LLM path reaches a terminal provider result and that the SSE chain returns that terminal data promptly instead of hanging until timeout.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: fe04faa564

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-20T07:24:48Z

            if (onAcceptedAsync != null)
                await onAcceptedAsync(receipt, ct);

+            await _dispatchPipeline.DispatchPreparedAsync(execution, ct);


Emit accepted only after dispatch admission

onAcceptedAsync now runs before DispatchPreparedAsync, so SSE/WebSocket clients can receive a run-started/accepted ack even when dispatch later fails (or is canceled) and the command never reaches the actor inbox. In HandleChat, this also starts the stream early, so failures after that point can no longer be returned as normal start errors and become stream-error frames instead, leaving clients with a false positive “accepted” state.

Useful? React with 👍 / 👎.

Thanks for calling this out. For this endpoint the early frame is intentional: after PrepareAsync succeeds we have a stable command/run id and can start the SSE stream before the potentially long actor/LLM dispatch path. Moving onAcceptedAsync back after DispatchPreparedAsync would reintroduce the zero-frame/late-accepted behavior this PR is fixing. If DispatchPreparedAsync fails after that point, the honest outcome should be a stream error/terminal frame rather than a pre-stream start error. I also rechecked the follow-up projector change so committed completions no longer replay content when ContentEmitted=true; they only synthesize the missing terminal frames.

codecov · 2026-05-20T07:36:41Z

Codecov Report

❌ Patch coverage is 92.99065% with 15 lines in your changes missing coverage. Please review.
✅ Project coverage is 82.52%. Comparing base (ab91b35) to head (3efd844).
⚠️ Report is 17 commits behind head on dev.

Files with missing lines	Patch %	Lines
src/Aevatar.AI.Core/RoleGAgent.cs	58.82%	6 Missing and 1 partial ⚠️
...actions/ScopeGAgents/ScopeGAgentAguiEventMapper.cs	88.63%	2 Missing and 3 partials ⚠️
.../Projectors/GAgentDraftRunSessionEventProjector.cs	98.58%	0 Missing and 2 partials ⚠️
...tService.Hosting/Endpoints/ScopeGAgentEndpoints.cs	66.66%	1 Missing ⚠️

@@            Coverage Diff             @@
##              dev     #740      +/-   ##
==========================================
+ Coverage   82.48%   82.52%   +0.03%     
==========================================
  Files         941      941              
  Lines       60101    60270     +169     
  Branches     7872     7890      +18     
==========================================
+ Hits        49575    49737     +162     
- Misses       7131     7135       +4     
- Partials     3395     3398       +3

Flag	Coverage Δ
ci	`82.52% <92.99%> (+0.03%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
...e/Interactions/DefaultCommandInteractionService.cs	`80.39% <100.00%> (+0.19%)`	⬆️
...ervices/ActorDispatchStudioMemberCommandService.cs	`92.65% <100.00%> (+0.40%)`	⬆️
...tService.Hosting/Endpoints/ScopeGAgentEndpoints.cs	`83.26% <66.66%> (+0.03%)`	⬆️
.../Projectors/GAgentDraftRunSessionEventProjector.cs	`98.14% <98.58%> (+21.95%)`	⬆️
...actions/ScopeGAgents/ScopeGAgentAguiEventMapper.cs	`87.09% <88.63%> (-2.12%)`	⬇️
src/Aevatar.AI.Core/RoleGAgent.cs	`79.05% <58.82%> (-0.44%)`	⬇️

... and 1 file with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

louis4li · 2026-05-20T08:02:38Z

Review/runtime follow-up:

Re-reviewed the projector/mapper split after the terminal completion change. The mapper now stays as single-envelope -> single AG-UI frame translation; committed RoleChatSessionCompletedEvent expansion lives in GAgentDraftRunSessionEventProjector because success completion can produce TextMessageStart + TextMessageContent + TextMessageEnd + RunFinished.
Verified locally in distribute mode on http://127.0.0.1:5100 with Kafka/Garnet/Elasticsearch and Neo4j configured as neo4j/Password. Neo4j auth still fails in Development and is ignored; document projection startup probe passes.
Runtime SSE check for /api/scopes/scope-issue736-chain/gagent/draft-run returned runStarted followed by the real downstream runError: NyxID authentication required for provider nyxid. This confirms the issue-2 chain now returns actual terminal data through the projector/session stream instead of the previous generic timeout/500 path.
Successful terminal content preservation is covered by ScopeServiceEndpointsStreamTests, including ContentEmitted=true producing TextMessageContent delta=pong before RunFinished.

Validation run:

dotnet test test/Aevatar.GAgentService.Tests/Aevatar.GAgentService.Tests.csproj --nologo --filter "FullyQualifiedName~ScopeGAgentAguiEventMapperTests"
dotnet test test/Aevatar.GAgentService.Integration.Tests/Aevatar.GAgentService.Integration.Tests.csproj --nologo --filter "FullyQualifiedName~ScopeServiceEndpointsStreamTests"
bash tools/ci/test_stability_guards.sh
bash tools/ci/query_projection_priming_guard.sh
bash tools/ci/projection_state_version_guard.sh
bash tools/ci/projection_state_mirror_current_state_guard.sh
git diff --check

louis4li · 2026-05-20T08:09:56Z

补充正常链路实测结果：这次不是只测异常链路，已在本地 distribute Host 下把 LLM provider 指到本地 OpenAI-compatible mock server，并重新调用 draft-run SSE。\n\n请求：\nbash\ncurl -sS -N --max-time 35 -X POST \\n http://127.0.0.1:5100/api/scopes/scope-issue736-success-rerun/gagent/draft-run \\n -H 'Content-Type: application/json' \\n -H 'Accept: text/event-stream' \\n -d '{"actorTypeName":"Aevatar.AI.Core.RoleGAgent, Aevatar.AI.Core","prompt":"Please reply with exactly: pong","timeoutMs":10000}'\n\n\n实际返回包含正常内容帧：\ntext\ndata: { "runStarted": { "threadId": "Role:96eaf780", "runId": "55d3a9e55a624e9cb12482030c1a95f4" } }\n\ndata: { "textMessageStart": { "messageId": "55d3a9e55a624e9cb12482030c1a95f4", "role": "assistant" } }\n\ndata: { "textMessageContent": { "messageId": "55d3a9e55a624e9cb12482030c1a95f4", "delta": "pong" } }\n\ndata: { "textMessageEnd": { "messageId": "55d3a9e55a624e9cb12482030c1a95f4" } }\n\ndata: { "runFinished": { "threadId": "Role:96eaf780", "runId": "55d3a9e55a624e9cb12482030c1a95f4" } }\n\n\n结论：正常链路里 projector 会把 committed completion 展开成 start/content/end/finished，且 content 确认有数据返回（delta=pong），不是只有异常链路可达。

louis4li · 2026-05-20T08:31:11Z

补充问题 1 的本地复现与修复结果：\n\n复现方式：走 Studio member API，创建 member 后用 issue 中这组 GAgent binding 参数调用：\nPUT /api/scopes/scope-issue736-studio/members/m-9833881c18e14c19aab60b2b9c7e998f/binding\n\n请求体里 endpoint 没带 responseTypeUrl：\njson\n{\n "implementationKind": "gagent",\n "displayName": "m-9833881c18e14c19aab60b2b9c7e998f",\n "gagent": {\n "actorTypeName": "Aevatar.Studio.Hosting.Endpoints.ScriptGenerateGAgent, Aevatar.Studio.Hosting",\n "endpoints": [\n {\n "endpointId": "run",\n "displayName": "Run",\n "kind": "command",\n "requestTypeUrl": "type.googleapis.com/google.protobuf.StringValue",\n "description": "You are the team member gagent. Own long-lived state and answer through the selected tools."\n }\n ]\n }\n}\n\n\n修复前实测返回 500：\ntext\nSystem.ArgumentNullException: Value cannot be null. (Parameter 'value')\n at ...StudioMemberGAgentEndpointBindingRequest.set_ResponseTypeUrl(String value)\n at ...ActorDispatchStudioMemberCommandService.BuildBindingRequest(...)\n\n\n根因：responseTypeUrl 对 command endpoint 不应是必填。JSON 缺字段绑定成 null 后，后端直接写入 protobuf string setter，导致 protobuf 拒绝 null 并抛 500。\n\n修复：在 Studio member command mapping 边界把 GAgent endpoint 的 protobuf string 字段做 null-to-empty 规范化，特别是缺失的 responseTypeUrl 映射为空字符串。\n\n修复后同一个请求本地复测结果：\ntext\nHTTP/1.1 202 Accepted\n{"status":"accepted","bindingRunId":"bind-bb19cee2f4e44bc4bd50f412003b55a0","scopeId":"scope-issue736-studio-fixed","memberId":"m-9833881c18e14c19aab60b2b9c7e998f"}\n\n\n随后查询 binding run：\njson\n{"bindingRunId":"bind-bb19cee2f4e44bc4bd50f412003b55a0","scopeId":"scope-issue736-studio-fixed","memberId":"m-9833881c18e14c19aab60b2b9c7e998f","status":"succeeded","failure":null,"platformBindingCommandId":"platform-bind-bb19cee2f4e44bc4bd50f412003b55a0-1"}\n\n\n验证：\n- dotnet test test/Aevatar.Studio.Tests/Aevatar.Studio.Tests.csproj --nologo --filter "FullyQualifiedName~ActorDispatchStudioMemberCommandServiceTests"：Passed 13\n- bash tools/ci/test_stability_guards.sh：Passed\n- bash tools/ci/query_projection_priming_guard.sh：Passed\n- git diff --check：Passed\n\n提交：4d521c29 Default missing Studio GAgent response type

Respect the committed completion content-emitted flag so recovery only synthesizes terminal frames after live content has already been streamed. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

louis4li · 2026-05-20T09:20:34Z

Follow-up pushed in 72bb0330.

What changed:

RoleChatSessionCompletedEvent.ContentEmitted=true now only synthesizes the missing terminal frames (textMessageEnd + runFinished) instead of replaying textMessageStart/textMessageContent.
Kept the ContentEmitted=false fallback path so committed completion can still reconstruct full content frames when live content was not emitted.
Split projector coverage to assert both paths.

Validation:

dotnet test test/Aevatar.GAgentService.Integration.Tests/Aevatar.GAgentService.Integration.Tests.csproj --nologo --filter "FullyQualifiedName~ScopeServiceEndpointsStreamTests" — passed 18/18
git diff --check — passed

jason-aelf · 2026-05-21T02:59:13Z

            if (onAcceptedAsync != null)
                await onAcceptedAsync(receipt, ct);

+            await _dispatchPipeline.DispatchPreparedAsync(execution, ct);


This moves accepted/runStarted before dispatch. For GAgent draft-run, onAcceptedAsync starts the SSE response and sets ResponseStarted to true. If DispatchPreparedAsync or the actor handler throws afterward, the endpoint exception path skips prepared actor rollback because ResponseStarted == true.

This affects draft-run requests that create a new actor. GAgentDraftRunActorPreparationService marks newly created actors with RequiresRollbackOnFailure: true, but after early runStarted, the failure path may no longer unregister/destroy that temporary actor, leaving the actor and registry entry behind. Reusing an existing actor is not affected.

Good catch — this was a real issue. I pushed dbafc05 to address it.

What changed:

draft-run exception/timeout/client-disconnect paths now call RollbackPreparedActorAsync even after the SSE response has started

rollback remains gated by RequiresRollbackOnFailure, so existing actor reuse is not affected

added coverage for the case where runStarted is emitted first and a later failure still rolls back the prepared temporary actor

Validation:

dotnet test test/Aevatar.GAgentService.Integration.Tests/Aevatar.GAgentService.Integration.Tests.csproj --nologo --no-restore --filter "FullyQualifiedName~ScopeServiceEndpointsStreamTests" — passed 19/19

git diff --check — passed

Ensure temporary draft-run actors are cleaned up even when an accepted SSE frame has already been sent and a later dispatch or execution failure occurs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add focused stream projector tests for ignored envelopes, live terminal frame synthesis, runFinished id completion, and committed completion failure/empty-content paths. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…kend-issues' into fix/2026-05-20_gagent-member-backend-issues

Fix draft run SSE accepted timing

fe04faa

louis4li requested a review from jason-aelf as a code owner May 20, 2026 07:21

chatgpt-codex-connector Bot reviewed May 20, 2026

View reviewed changes

Propagate draft run terminal events

2ea23ec

louis4li requested a review from eanzhao as a code owner May 20, 2026 07:41

Preserve draft run terminal completion content

210909d

Default missing Studio GAgent response type

4d521c2

louis4li and others added 2 commits May 20, 2026 16:35

Merge branch 'dev' into fix/2026-05-20_gagent-member-backend-issues

8756b88

Avoid replaying emitted draft-run content

72bb033

Respect the committed completion content-emitted flag so recovery only synthesizes terminal frames after live content has already been streamed. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

jason-aelf reviewed May 21, 2026

View reviewed changes

louis4li and others added 8 commits May 21, 2026 11:36

Rollback draft-run actors after streamed failures

dbafc05

Ensure temporary draft-run actors are cleaned up even when an accepted SSE frame has already been sent and a later dispatch or execution failure occurs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Merge branch 'dev' into fix/2026-05-20_gagent-member-backend-issues

6792bd6

Cover draft-run projector branches

7ceecf5

Add focused stream projector tests for ignored envelopes, live terminal frame synthesis, runFinished id completion, and committed completion failure/empty-content paths. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Merge remote-tracking branch 'origin/fix/2026-05-20_gagent-member-bac…

7193333

…kend-issues' into fix/2026-05-20_gagent-member-backend-issues

Add issue 736 integration guards

7f13c23

Fix studio generate draft-run chat responses

0fe1b49

Fix draft-run normal content streaming

fb427ac

Fix draft-run completion content fallback

3efd844

louis4li closed this May 22, 2026

louis4li reopened this May 22, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix draft-run SSE accepted timing#740

Fix draft-run SSE accepted timing#740
louis4li wants to merge 14 commits into
devfrom
fix/2026-05-20_gagent-member-backend-issues

louis4li commented May 20, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 20, 2026

Uh oh!

louis4li May 20, 2026

Uh oh!

codecov Bot commented May 20, 2026 •

edited

Loading

Uh oh!

louis4li commented May 20, 2026

Uh oh!

louis4li commented May 20, 2026

Uh oh!

louis4li commented May 20, 2026

Uh oh!

louis4li commented May 20, 2026

Uh oh!

jason-aelf May 21, 2026

Uh oh!

louis4li May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

louis4li commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Issue #736 scope

Verification

Local runtime check

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 20, 2026

Choose a reason for hiding this comment

Uh oh!

louis4li May 20, 2026

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

louis4li commented May 20, 2026

Uh oh!

louis4li commented May 20, 2026

Uh oh!

louis4li commented May 20, 2026

Uh oh!

louis4li commented May 20, 2026

Uh oh!

jason-aelf May 21, 2026

Choose a reason for hiding this comment

Uh oh!

louis4li May 21, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

louis4li commented May 20, 2026 •

edited

Loading

codecov Bot commented May 20, 2026 •

edited

Loading