Skip to content

Fix draft-run SSE accepted timing#740

Open
louis4li wants to merge 14 commits into
devfrom
fix/2026-05-20_gagent-member-backend-issues
Open

Fix draft-run SSE accepted timing#740
louis4li wants to merge 14 commits into
devfrom
fix/2026-05-20_gagent-member-backend-issues

Conversation

@louis4li
Copy link
Copy Markdown
Contributor

@louis4li louis4li commented May 20, 2026

Summary

  • emit the draft-run accepted/runStarted frame after command preparation and before awaiting prepared dispatch
  • keep stream pumping after dispatch admission so long-running actor/LLM execution no longer leaves SSE clients with an open zero-frame response
  • map committed RoleChatSessionCompletedEvent terminal facts through the draft-run projection session so SSE receives real downstream terminal frames
  • return terminal provider/LLM failures as AG-UI runError frames instead of falling back to a generic endpoint timeout

Issue #736 scope

This PR fixes issue #736 problem 2. I also tried to reproduce problem 1 locally in distributed mode, including the PR #702 GAgent member payload, but the member binding path returned 202 and the binding run succeeded; missing members were asynchronously rejected with STUDIO_MEMBER_NOT_FOUND. No 500 was reproduced locally, so this PR intentionally does not claim to fix problem 1.

Verification

  • dotnet test test/Aevatar.Workflow.Application.Tests/Aevatar.Workflow.Application.Tests.csproj --nologo --filter "FullyQualifiedName~WorkflowApplicationLayerTests"
  • dotnet test test/Aevatar.Studio.Tests/Aevatar.Studio.Tests.csproj --nologo --filter "FullyQualifiedNameStudioMemberEndpointsTests|FullyQualifiedNameScopeBindingStudioMemberPlatformBindingCommandServiceTests"
  • dotnet test test/Aevatar.GAgentService.Tests/Aevatar.GAgentService.Tests.csproj --nologo --filter "FullyQualifiedNameScopeGAgentAguiEventMapperTests|FullyQualifiedNameGAgentDraftRunInteractionCoverageTests"
  • dotnet test test/Aevatar.GAgentService.Integration.Tests/Aevatar.GAgentService.Integration.Tests.csproj --nologo --filter "FullyQualifiedNameScopeServiceEndpointsStreamTests|FullyQualifiedNameScopeGAgentEndpointsTests"
  • bash tools/ci/test_stability_guards.sh
  • bash tools/ci/query_projection_priming_guard.sh
  • bash tools/ci/projection_state_version_guard.sh
  • bash tools/ci/projection_state_mirror_current_state_guard.sh
  • dotnet build aevatar.slnx --nologo

Local runtime check

Started Mainnet Host in distributed mode against local Kafka/Garnet/Elasticsearch with Orleans/Kafka/Garnet runtime configuration and called POST /api/scopes/{scopeId}/gagent/draft-run. Neo4j still reported an auth failure in Development and was ignored by startup, matching the earlier local setup.

The SSE response now produces a real downstream terminal event, not just an initial runStarted frame and not a generic timeout. With no local NyxID auth token, the observed frames were:

data: { "runStarted": { "threadId": "Role:ce480c92", "runId": "27699f500f124d3c877d9db5f2ce20a4" } }

data: { "runError": { "message": "LLM request failed [tools=...]: NyxID authentication required for provider 'nyxid'. Please sign in." } }\n```\n\nSo the local unauthenticated environment cannot prove a successful `pong` text response, but it does prove the actor/LLM path reaches a terminal provider result and that the SSE chain returns that terminal data promptly instead of hanging until timeout.

@louis4li louis4li requested a review from jason-aelf as a code owner May 20, 2026 07:21
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: fe04faa564

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines 61 to +64
if (onAcceptedAsync != null)
await onAcceptedAsync(receipt, ct);

await _dispatchPipeline.DispatchPreparedAsync(execution, ct);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Emit accepted only after dispatch admission

onAcceptedAsync now runs before DispatchPreparedAsync, so SSE/WebSocket clients can receive a run-started/accepted ack even when dispatch later fails (or is canceled) and the command never reaches the actor inbox. In HandleChat, this also starts the stream early, so failures after that point can no longer be returned as normal start errors and become stream-error frames instead, leaving clients with a false positive “accepted” state.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for calling this out. For this endpoint the early frame is intentional: after PrepareAsync succeeds we have a stable command/run id and can start the SSE stream before the potentially long actor/LLM dispatch path. Moving onAcceptedAsync back after DispatchPreparedAsync would reintroduce the zero-frame/late-accepted behavior this PR is fixing. If DispatchPreparedAsync fails after that point, the honest outcome should be a stream error/terminal frame rather than a pre-stream start error. I also rechecked the follow-up projector change so committed completions no longer replay content when ContentEmitted=true; they only synthesize the missing terminal frames.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 20, 2026

Codecov Report

❌ Patch coverage is 92.99065% with 15 lines in your changes missing coverage. Please review.
✅ Project coverage is 82.52%. Comparing base (ab91b35) to head (3efd844).
⚠️ Report is 17 commits behind head on dev.

Files with missing lines Patch % Lines
src/Aevatar.AI.Core/RoleGAgent.cs 58.82% 6 Missing and 1 partial ⚠️
...actions/ScopeGAgents/ScopeGAgentAguiEventMapper.cs 88.63% 2 Missing and 3 partials ⚠️
.../Projectors/GAgentDraftRunSessionEventProjector.cs 98.58% 0 Missing and 2 partials ⚠️
...tService.Hosting/Endpoints/ScopeGAgentEndpoints.cs 66.66% 1 Missing ⚠️
@@            Coverage Diff             @@
##              dev     #740      +/-   ##
==========================================
+ Coverage   82.48%   82.52%   +0.03%     
==========================================
  Files         941      941              
  Lines       60101    60270     +169     
  Branches     7872     7890      +18     
==========================================
+ Hits        49575    49737     +162     
- Misses       7131     7135       +4     
- Partials     3395     3398       +3     
Flag Coverage Δ
ci 82.52% <92.99%> (+0.03%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...e/Interactions/DefaultCommandInteractionService.cs 80.39% <100.00%> (+0.19%) ⬆️
...ervices/ActorDispatchStudioMemberCommandService.cs 92.65% <100.00%> (+0.40%) ⬆️
...tService.Hosting/Endpoints/ScopeGAgentEndpoints.cs 83.26% <66.66%> (+0.03%) ⬆️
.../Projectors/GAgentDraftRunSessionEventProjector.cs 98.14% <98.58%> (+21.95%) ⬆️
...actions/ScopeGAgents/ScopeGAgentAguiEventMapper.cs 87.09% <88.63%> (-2.12%) ⬇️
src/Aevatar.AI.Core/RoleGAgent.cs 79.05% <58.82%> (-0.44%) ⬇️

... and 1 file with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@louis4li louis4li requested a review from eanzhao as a code owner May 20, 2026 07:41
@louis4li
Copy link
Copy Markdown
Contributor Author

Review/runtime follow-up:

  • Re-reviewed the projector/mapper split after the terminal completion change. The mapper now stays as single-envelope -> single AG-UI frame translation; committed RoleChatSessionCompletedEvent expansion lives in GAgentDraftRunSessionEventProjector because success completion can produce TextMessageStart + TextMessageContent + TextMessageEnd + RunFinished.
  • Verified locally in distribute mode on http://127.0.0.1:5100 with Kafka/Garnet/Elasticsearch and Neo4j configured as neo4j/Password. Neo4j auth still fails in Development and is ignored; document projection startup probe passes.
  • Runtime SSE check for /api/scopes/scope-issue736-chain/gagent/draft-run returned runStarted followed by the real downstream runError: NyxID authentication required for provider nyxid. This confirms the issue-2 chain now returns actual terminal data through the projector/session stream instead of the previous generic timeout/500 path.
  • Successful terminal content preservation is covered by ScopeServiceEndpointsStreamTests, including ContentEmitted=true producing TextMessageContent delta=pong before RunFinished.

Validation run:

  • dotnet test test/Aevatar.GAgentService.Tests/Aevatar.GAgentService.Tests.csproj --nologo --filter "FullyQualifiedName~ScopeGAgentAguiEventMapperTests"
  • dotnet test test/Aevatar.GAgentService.Integration.Tests/Aevatar.GAgentService.Integration.Tests.csproj --nologo --filter "FullyQualifiedName~ScopeServiceEndpointsStreamTests"
  • bash tools/ci/test_stability_guards.sh
  • bash tools/ci/query_projection_priming_guard.sh
  • bash tools/ci/projection_state_version_guard.sh
  • bash tools/ci/projection_state_mirror_current_state_guard.sh
  • git diff --check

@louis4li
Copy link
Copy Markdown
Contributor Author

补充正常链路实测结果:这次不是只测异常链路,已在本地 distribute Host 下把 LLM provider 指到本地 OpenAI-compatible mock server,并重新调用 draft-run SSE。\n\n请求:\nbash\ncurl -sS -N --max-time 35 -X POST \\n http://127.0.0.1:5100/api/scopes/scope-issue736-success-rerun/gagent/draft-run \\n -H 'Content-Type: application/json' \\n -H 'Accept: text/event-stream' \\n -d '{"actorTypeName":"Aevatar.AI.Core.RoleGAgent, Aevatar.AI.Core","prompt":"Please reply with exactly: pong","timeoutMs":10000}'\n\n\n实际返回包含正常内容帧:\ntext\ndata: { "runStarted": { "threadId": "Role:96eaf780", "runId": "55d3a9e55a624e9cb12482030c1a95f4" } }\n\ndata: { "textMessageStart": { "messageId": "55d3a9e55a624e9cb12482030c1a95f4", "role": "assistant" } }\n\ndata: { "textMessageContent": { "messageId": "55d3a9e55a624e9cb12482030c1a95f4", "delta": "pong" } }\n\ndata: { "textMessageEnd": { "messageId": "55d3a9e55a624e9cb12482030c1a95f4" } }\n\ndata: { "runFinished": { "threadId": "Role:96eaf780", "runId": "55d3a9e55a624e9cb12482030c1a95f4" } }\n\n\n结论:正常链路里 projector 会把 committed completion 展开成 start/content/end/finished,且 content 确认有数据返回(delta=pong),不是只有异常链路可达。

@louis4li
Copy link
Copy Markdown
Contributor Author

补充问题 1 的本地复现与修复结果:\n\n复现方式:走 Studio member API,创建 member 后用 issue 中这组 GAgent binding 参数调用:\nPUT /api/scopes/scope-issue736-studio/members/m-9833881c18e14c19aab60b2b9c7e998f/binding\n\n请求体里 endpoint 没带 responseTypeUrl:\njson\n{\n "implementationKind": "gagent",\n "displayName": "m-9833881c18e14c19aab60b2b9c7e998f",\n "gagent": {\n "actorTypeName": "Aevatar.Studio.Hosting.Endpoints.ScriptGenerateGAgent, Aevatar.Studio.Hosting",\n "endpoints": [\n {\n "endpointId": "run",\n "displayName": "Run",\n "kind": "command",\n "requestTypeUrl": "type.googleapis.com/google.protobuf.StringValue",\n "description": "You are the team member gagent. Own long-lived state and answer through the selected tools."\n }\n ]\n }\n}\n\n\n修复前实测返回 500:\ntext\nSystem.ArgumentNullException: Value cannot be null. (Parameter 'value')\n at ...StudioMemberGAgentEndpointBindingRequest.set_ResponseTypeUrl(String value)\n at ...ActorDispatchStudioMemberCommandService.BuildBindingRequest(...)\n\n\n根因:responseTypeUrl 对 command endpoint 不应是必填。JSON 缺字段绑定成 null 后,后端直接写入 protobuf string setter,导致 protobuf 拒绝 null 并抛 500。\n\n修复:在 Studio member command mapping 边界把 GAgent endpoint 的 protobuf string 字段做 null-to-empty 规范化,特别是缺失的 responseTypeUrl 映射为空字符串。\n\n修复后同一个请求本地复测结果:\ntext\nHTTP/1.1 202 Accepted\n{"status":"accepted","bindingRunId":"bind-bb19cee2f4e44bc4bd50f412003b55a0","scopeId":"scope-issue736-studio-fixed","memberId":"m-9833881c18e14c19aab60b2b9c7e998f"}\n\n\n随后查询 binding run:\njson\n{"bindingRunId":"bind-bb19cee2f4e44bc4bd50f412003b55a0","scopeId":"scope-issue736-studio-fixed","memberId":"m-9833881c18e14c19aab60b2b9c7e998f","status":"succeeded","failure":null,"platformBindingCommandId":"platform-bind-bb19cee2f4e44bc4bd50f412003b55a0-1"}\n\n\n验证:\n- dotnet test test/Aevatar.Studio.Tests/Aevatar.Studio.Tests.csproj --nologo --filter "FullyQualifiedName~ActorDispatchStudioMemberCommandServiceTests":Passed 13\n- bash tools/ci/test_stability_guards.sh:Passed\n- bash tools/ci/query_projection_priming_guard.sh:Passed\n- git diff --check:Passed\n\n提交:4d521c29 Default missing Studio GAgent response type

louis4li and others added 2 commits May 20, 2026 16:35
Respect the committed completion content-emitted flag so recovery only synthesizes terminal frames after live content has already been streamed.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@louis4li
Copy link
Copy Markdown
Contributor Author

Follow-up pushed in 72bb0330.

What changed:

  • RoleChatSessionCompletedEvent.ContentEmitted=true now only synthesizes the missing terminal frames (textMessageEnd + runFinished) instead of replaying textMessageStart/textMessageContent.
  • Kept the ContentEmitted=false fallback path so committed completion can still reconstruct full content frames when live content was not emitted.
  • Split projector coverage to assert both paths.

Validation:

  • dotnet test test/Aevatar.GAgentService.Integration.Tests/Aevatar.GAgentService.Integration.Tests.csproj --nologo --filter "FullyQualifiedName~ScopeServiceEndpointsStreamTests" — passed 18/18
  • git diff --check — passed

if (onAcceptedAsync != null)
await onAcceptedAsync(receipt, ct);

await _dispatchPipeline.DispatchPreparedAsync(execution, ct);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This moves accepted/runStarted before dispatch. For GAgent draft-run, onAcceptedAsync starts the SSE response and sets ResponseStarted to true. If DispatchPreparedAsync or the actor handler throws afterward, the endpoint exception path skips prepared actor rollback because ResponseStarted == true.

This affects draft-run requests that create a new actor. GAgentDraftRunActorPreparationService marks newly created actors with RequiresRollbackOnFailure: true, but after early runStarted, the failure path may no longer unregister/destroy that temporary actor, leaving the actor and registry entry behind. Reusing an existing actor is not affected.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — this was a real issue. I pushed dbafc05 to address it.

What changed:

  • draft-run exception/timeout/client-disconnect paths now call RollbackPreparedActorAsync even after the SSE response has started
  • rollback remains gated by RequiresRollbackOnFailure, so existing actor reuse is not affected
  • added coverage for the case where runStarted is emitted first and a later failure still rolls back the prepared temporary actor

Validation:

  • dotnet test test/Aevatar.GAgentService.Integration.Tests/Aevatar.GAgentService.Integration.Tests.csproj --nologo --no-restore --filter "FullyQualifiedName~ScopeServiceEndpointsStreamTests" — passed 19/19
  • git diff --check — passed

louis4li and others added 8 commits May 21, 2026 11:36
Ensure temporary draft-run actors are cleaned up even when an accepted SSE frame has already been sent and a later dispatch or execution failure occurs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add focused stream projector tests for ignored envelopes, live terminal frame synthesis, runFinished id completion, and committed completion failure/empty-content paths.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…kend-issues' into fix/2026-05-20_gagent-member-backend-issues
@louis4li louis4li closed this May 22, 2026
@louis4li louis4li reopened this May 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants