demo(agentserver): TEMPORARY - durable-agent-demo (split out of #46997, never-merged) by RaviPidaparthi · Pull Request #47276 · Azure/azure-sdk-for-python

RaviPidaparthi · 2026-06-02T03:16:30Z

Summary — TEMPORARY / DO NOT MERGE

This is the durable-agent-demo split out of the original spec 016
durability PR (#46997). It carries the azd-deployable hosted-agent
demo (34 files: bicep infra, .azure azd state, src/durable-research-agent
agent code, build/demo-client scripts).

Status

🚨 This PR is not intended for merge. The demo lives here purely so
it isn't lost from the working set; we use it temporarily as a
reference deployment while the durable-task primitive matures.

Scope

sdk/agentserver/azure-ai-agentserver-invocations/samples/durable-agent-demo/
only (34 files). Plus whatever else came from the original split-point
branch — see the next section for cleanup needed.

What this branch needs before any potential reuse

Filter the diff to ONLY the durable-agent-demo/ directory
(everything else should be discarded by reverting to origin/main
for those paths, since the core+invocations work belongs in feat(agentserver): light up durable-task primitive (core 2.0.0b6 + invocations 1.0.0b5) #46997
and the responses work belongs in PR feat(agentserver): WIP - responses package durable orchestration (split out of #46997) #47275).

Pointers

The distilled invocations sample (samples/durable_research)
derived from this demo is shipping in PR feat(agentserver): light up durable-task primitive (core 2.0.0b6 + invocations 1.0.0b5) #46997 — the demo here
remains as the fuller azd-deployable reference.

This commit restores the azd-deployable durable-agent-demo (34 files) that was moved out of the core PR (#46997) to keep scope manageable. Sits on top of the core PR branch so it only shows the demo delta. 🚨 TEMPORARY — this PR is NOT intended for merge. The demo lives here purely so it isn't lost from the working set; we use it as a reference deployment while the durable-task primitive matures. The distilled invocations sample (samples/durable_research) derived from this demo ships in PR #46997 instead. Restored from safety-spec016-backup-2026-06-02 (SHA 3df9c5b). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…rver-durable-agent-demo

…16 core Three call-site updates to align the demo with the spec 016 public surface: 1. Drop TaskTerminated from imports (it was removed from the public surface — TaskCancelled now covers cooperative-cancel paths). 2. Drop session_id= from deep_research.start() — session is platform- derived from FOUNDRY_AGENT_SESSION_ID, not a per-call argument. 3. Await deep_research.get_active_run(task_id) — it's now an async method (the framework needs to consult the task store, not just in-memory state) so the previous synchronous call returned a coroutine, not a TaskRun. Also refreshes the bundled wheels (b4 -> core 2.0.0b6 + invocations 1.0.0b5) and the azd env state from a fresh 'azd up' deployment against the e2e-tests-westus2 Foundry project. Verified end-to-end against the deployed agent: ./demo-client.sh start "durable tasks demonstration" -> streams stages 1/12, 2/12, ... live via SSE ./demo-client.sh crash -> {"status":"crashing"}; supervisor restarts the container Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

@task

…story The platform now provides two capabilities that obsolete our application-level infrastructure: * nanny worker restarts the container within ~5-10 min of a crash * lease-renewal on @task internally pings /readiness so the sandbox stays alive as long as a durable task is executing (no need for client traffic) This commit rewrites the durable-research-agent demo around those guarantees and adds steering as a third headline capability. Removed: * supervisor.py (170 lines) — the PID-1 reverse proxy + restart loop * entrypoint.sh — the auto-restart bash wrapper * aiohttp production dep (only supervisor used it) * Dockerfile FOUNDRY_TASK_API_ENABLED (auto-selected now per dev guide) * Multi-shape input parsing (per feedback: it's a demo; stick to one shape) Container now runs 'python app.py' directly; CMD changed accordingly and azure.yaml startupCommand updated to match. agent.py rewrite (~221 lines): * 15 phases x 4 LLM sub-calls (research -> critique -> refine -> synthesize) targeting ~45 min total wall time (gpt-4.1-mini, 1500 output tokens/call). Env-overridable: NUM_PHASES, CALLS_PER_PHASE, TARGET_OUTPUT_TOKENS, INTRA_PHASE_COOLDOWN_SEC, INTER_PHASE_COOLDOWN_SEC. * Every phase emits phase_start + phase_end events with server_time_utc (UTC ISO8601 with ms) and server_uptime_sec. The uptime resets to ~0 after the platform nanny restarts the container — so a viewer can SEE the crash recovery happen in the stream. * @task(steerable=True). On every checkpoint boundary the handler checks ctx.cancel.is_set() and (when pending_input_count > 0) emits a winding_down event with cause + returns ctx.suspend(). The framework drains the next steering input as a fresh turn. * Topic-change detection at handler entry resets checkpoint state when the steered topic differs from the previously stored one. app.py rewrite: * task_id = session_id (was invocation_id) so steering routes correctly: second POST on the same session hits the active task and queues input. * POST /invocations with {"message": "crash"} (DEMO_MODE=1) exits the process so the platform nanny restarts the container. The platform only proxies /invocations* — we can't add custom routes. * GET /invocations/{id} falls back to file replay when no live run is present, so reconnecting after the task completes still shows the full transcript (regression fix per review feedback). * session_id read from app.config.session_id when not on request.state (GET state doesn't carry it). demo-client.sh rewrite: * Pretty SSE renderer that recognises the new event types (run_start, recovered, phase_start/end, subcall_start/end, winding_down, run_complete) and box-prints the timestamps. * Commands: start, stream (reconnect), steer, crash, cancel, status, logs, reset. No more auto-reconnect spam — disconnects suggest manual reconnect, matching the long-run / no-ingress demo flow. * Three-terminal demo workflow documented in --help and README. README rewrite: * Documents the three capabilities (long-running > 15 min, crash recovery via platform nanny ~5-10 min, steering). * New A/B/C demo walkthroughs (long-run no-ingress, crash-recovery, steering). * Architecture diagram drops supervisor; lists the platform-managed behaviors (nanny, lease renewal) explicitly. * Env var table reflects new tuning knobs. Verified end-to-end against the deployed agent (e2e-tests-westus2 / durable-research-agent): * Streaming with timestamps -> all event types render correctly * Steering -> 'Steering drain: task drained next input' in server logs; new turn runs new topic * GET after completion -> HTTP 200 + SSE file replay * Crash dispatch -> POST returns 202; next POST gets 424 (container down, awaiting nanny restart) Refreshed wheels (core 2.0.0b6 + invocations 1.0.0b5) and azd env state. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

@task

… detection Two changes after platform-behavior verification testing: 1. README / app.py / demo-client.sh: correct the platform-restart story. Earlier docs said 'platform nanny restarts the container within ~5-10 minutes' — autonomous. Empirical observation during testing showed the actual behavior is **ingress-triggered**: * Crashed containers stay down with NO ingress. * The next inbound request triggers the platform to bring the container back, which happens in ~10 seconds (much faster than the 5-10 min worst-case figure). * The durable task then auto-recovers from its last checkpoint. Verified by waiting 16 min after crash with zero ingress, then reconnecting and observing the container started 11 sec AFTER my reconnect GET (server logs: 'AgentServerHost started 21:03:18' for reconnect at 21:03:07). User-facing experience is unchanged: any reconnect attempt seamlessly restores the task. (The lease-renewal-keeps-sandbox-alive story is also verified — Test B showed phases progressing from uptime 47s -> 569s linearly with no resets during a 9.5-min no-ingress window. The framework's internal lease-renewal cycle ingresses /readiness internally, which keeps the sandbox alive while the @task is executing.) 2. agent.py _wind_down(): change cause detection to use exclusion. ctx.pending_input_count is often back to 0 by the time the wind-down triggers (the framework drained the steering input before we observed). Detect by elimination instead: if neither timeout nor operator_cancel, it must be steering. Removes the bogus 'unknown' cause we saw in steer-test output. Verified end-to-end against the deployed agent: Test A (crash recovery, no-ingress): 20:42:32 dispatched 'carbon capture technology' 20:46:52 crash (after 4 phases done, uptime 205s) 20:46:52 + 16 min: NO ingress 21:03:07 reconnected with GET 21:03:18 container started (uptime 1.3s), task recovered from phase 5 checkpoint, resumed at phase 6 Test B (lease keeps sandbox alive): 20:23:59 dispatched 'supply chain resilience' 20:23:59 + 17 min: NO ingress During wait: phases 1-10 all completed; uptime grew 1.9s -> 569s linearly (no restarts during the no-ingress window) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

@task

Removed the 'sandbox stays alive while @task executes' claim from the README's headline capabilities — empirical testing showed this is NOT happening on the current platform deployment. What we verified: * Container is reclaimed at exactly the 15-min mark since the last user-facing ingress, regardless of whether a @task handler is actively running. * Framework's lease-renewal cycle goes to the task-store API (PATCH /api/projects/.../tasks/{id}), NOT to the agent container's /readiness endpoint. So lease renewal doesn't reset the platform's idle timer. * Crashed/reclaimed containers stay down with zero ingress. * The next ingress request brings the container back in ~10 sec. * Durable task auto-resumes from last checkpoint (entry_mode='recovered', correct completed_phases). README now describes the demo as two capabilities: 1. Crash + idle recovery — any reconnect after a crash or 15-min idle reclaim seamlessly resumes from the last checkpoint 2. Steering — mid-run topic switch via cooperative wind-down Long-running tasks DO complete (just by being reclaimed-and-recovered repeatedly rather than running uninterrupted on a single container). Section A reframed accordingly. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Three issues from review: 1. Session-id lifecycle was unclear. Added a 'Session-id lifecycle' subsection explaining that 'start' allocates a new UUID and writes it to .demo-session; stream/steer/crash/cancel/logs/status all reuse it; 'reset' clears it so the next 'start' allocates a fresh one. 2. Log inspection wasn't documented. Added an 'Inspecting container logs' subsection that points to './demo-client.sh logs' and 'azd ai agent monitor', and enumerates the most useful framework log lines (TaskManager starting, Reclaimed/Recovered task, /readiness probe, OpenAI HTTP requests, Steering drain). 3. Architecture diagram was stale. Removed 'POST /demo/crash' (no such route — platform only proxies /invocations*), removed the false 'lease renewal pings /readiness' callout, and added a clearer diagram that shows the Foundry control plane separately and calls out the actual mechanisms (lease renewal goes to task-storage API, /readiness is hit only by platform startup probe, container revival is ingress-triggered). Also added an upfront command-reference table covering every demo-client.sh subcommand, and fixed the env-var doc to reflect that overrides happen via Dockerfile/azure.yaml (the container runs the shipped image, not a local python app.py). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…rver-durable-agent-demo

Picks up the new TaskRun.__await__ method from the core branch (merged in). With this, callers of get_active_run / start can await the returned TaskRun directly to get the TaskResult, removing a pyright squiggle on: run = await deep_research.get_active_run(task_id) No changes to the demo's app.py or agent.py — they already use the correct pattern. This is purely refreshing the bundled wheels so the deployed agent picks up the new core build. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…r imports The demo dir lives under the invocations package which has a pyrightconfig.json that excludes samples/** but still applies its rules to opened files. When the IDE opens app.py / agent.py, it couldn't find the editable-installed agentserver packages without an explicit venvPath / venv setting. Adding a demo-local pyrightconfig.json that: * points venv at the repo's .venv (via the relative path) * suppresses reportMissingImports / reportAttributeAccessIssue (the in-tree editable install is enough; the imports work; we don't need warnings telling us otherwise on a demo) * keeps the meaningful checks (Optional access, argument type, general type issues, return type) Verified: pyright runs clean from the demo dir with this config (0 errors, 1 informational warning on .output Optional access). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

I added a demo-local pyrightconfig.json earlier in this session to work around an IDE squiggle. Root cause was much simpler: the venv just had an OLD wheel (2.0.0b4) cached from way back. Reinstalling the new 2.0.0b6 wheel (which has TaskRun.__await__) in the venv makes everything resolve correctly without any pyright config changes — the IDE was working fine before; this restores that. Reinstall command: pip uninstall -y azure-ai-agentserver-core azure-ai-agentserver-invocations pip install sdk/agentserver/azure-ai-agentserver-core \ sdk/agentserver/azure-ai-agentserver-invocations Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…rver-durable-agent-demo

The previous attempt to set FOUNDRY_TASK_API_ENABLED was rejected by the hosting platform (FOUNDRY_*/AGENT_* are reserved namespaces). Core has been updated to use AGENTSERVER_TASK_API_ENABLED instead — apply that here and refresh the bundled wheels. Effect: the demo container now uses HostedTaskProvider, so /tasks HTTP calls (lease renewals, readiness pings, state PATCHes) flow through the TaskApiLoggingPolicy and show up in 'demo-client.sh logs' as 'task-store request: ...' lines. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

… validation Captures the v25 deploy that exercised the lease-renewal + nanny-restore validation: Test 1 — lease keeps sandbox alive >15 min without client ingress: PASS Same lease_instance_id for 46+ min, 12 phases completed, only platform /liveness probes and our framework's PATCH .../tasks/<id> lease renewals (every ~30s) kept the sandbox warm. Test 2 — nanny restores crashed sandbox within ~15 min, zero ingress: PASS Crashed at 00:02:04Z; new worker came up at 00:02:47Z (43s later); durable task auto-resumed with entry_mode='recovered' from the last checkpoint (completed_phases: 2); progressed through 4 more phases with no client ingress. agent.yaml is back to default cooldowns (10/20s); only the AGENTSERVER_TASK_API_ENABLED=1 opt-in is retained (committed earlier in 1b1e334). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

@task

… validated behavior Sets hosted-mode INTRA_PHASE_COOLDOWN_SEC=30 and INTER_PHASE_COOLDOWN_SEC=30 in agent.yaml so the deployed durable-research-agent runs for ~33 min (15 phases × (~12s LLM + 3×30s intra + 30s inter)). The run intentionally exceeds the platform's 15-min sandbox-eviction window so each demo run exercises the framework's lease-renewal keep-alive path end-to-end — which is the whole point of @task durability and what we just validated empirically against e2e-tests-westus2. agent.py defaults (10/20s = ~15 min) are kept for local/dev iteration where the long wall-time isn't useful. README updates reflect what we proved (rather than what we previously assumed): - Recovery section now leads with 'long-running tasks survive past 15 min via lease-renewal keep-alive' as a first-class platform capability, not buried in a doubt-laden footnote. - Removed the 'Note on long-running tasks' disclaimer that claimed lease renewals do NOT extend the idle window — empirical evidence shows otherwise (Test 1: 46-min uptime, same instance throughout, zero client ingress after T=0). - Workflow A retitled 'Long-running run with no client-side keepalive' and rewritten to reflect: reconnecting after 25 min finds the SAME instance, not a recovered fresh one. - Workflow B (crash) reflects the nanny does the restore on its own within ~1 min — no client ingress required to bring the container back; the durable task auto-resumes inside the new process. - Architecture diagram's 'Idle-reclaim timer' note now explains it is kept fresh by framework lease-renewal traffic. - Env-var table now lists hosted vs agent.py defaults separately and includes AGENTSERVER_TASK_API_ENABLED with explanation. - Fast-dev-loop block now points at agent.yaml (not the Dockerfile) since env vars live in agent.yaml now. azd state synced to the v26 deploy that ships these settings. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…rver-durable-agent-demo

…renderer + add client wall-clock Core branch now auto-enables HostedTaskProvider in hosted environments, so this demo no longer needs AGENTSERVER_TASK_API_ENABLED. Likewise, wheels are now built centrally via sdk/agentserver/scripts/build-wheels.sh and staged into the docker build context — no committed wheels. CHANGES agent.yaml - Drop AGENTSERVER_TASK_API_ENABLED (auto-on in hosted). - Tighten the cooldown comment (no behavior change). build.sh - Delegate to the central sdk/agentserver/scripts/build-wheels.sh. - Stage wheels into src/durable-research-agent/wheels/ (gitignored docker-build dir), so the Dockerfile's COPY wheels/ ... still finds them at build time. - Per-sample build.sh is now a thin staging wrapper; no per-sample duplication of the build logic. src/durable-research-agent/wheels/*.whl (deleted) - Wheels are no longer committed. They're regenerated on demand. app.py — fix file_replay SSE double-encoding - FileStreamHandler.put writes json.dumps(item)+'\n', where item is itself a JSON string from ctx.stream(json.dumps({...})). The live_stream path correctly reads from the in-memory queue (which holds the original string). The file_replay path read the disk line via json.loads, then RE-WRAPPED with json.dumps before embedding in 'data: ...\n\n' — producing data: "{\"type\": \"...\"}" which the client rendered as '[unknown event] "{\"...\"}"'. - Decode once, embed the raw JSON string directly. Also add an isinstance check before the __done__ key lookup (the decoded value is a string for normal events). - Update crash-handler 202 response message + docstring to reflect validated behavior (nanny restores ~1 min, no ingress needed). demo-client.sh - Add _now_utc() helper and prefix every block-style event with '[HH:MM:SSZ]' — the client's local UTC wall-clock at render time — so users can compare against server_time= (server-side UTC) and uptime= (server process seconds-since-boot) for a clear timeline of phases vs lease renewals vs recoveries. - Update header comment: drop the wrong '~5-10 min' nanny restore and the wrong 'lease renewal pings readiness' phrasing; reflect the validated 30s lease cadence and ~1 min nanny window. - Three-terminal usage example: ~33 min (not 45) wall-time per run; nanny restores ~1 min after crash (no need to send any ingress to trigger recovery). - Crash-command output text: nanny brings container back on its own, no client action required. README.md - Capability #1 reframed: lease keep-alive proven end-to-end (e2e-tests-westus2), 33-min runs with zero client ingress. - Capability #2 reframed: nanny restores within ~1 min (43s measured) without any client ingress; recover-on-reconnect was a misread of the old behavior. - Deploy section: build.sh now delegates to the central script; points at USING_PRE_RELEASE_WHEELS.md for the wheel workflow. - Crash command row in the command-reference table: clearer wording around nanny-driven recovery. - Env-var table: drop AGENTSERVER_TASK_API_ENABLED row (gone); add a paragraph clarifying that hosted/local provider selection is automatic. - File-structure section: build.sh and wheels/ entries reflect the new layout; add pointer to the wheel-distribution doc. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

@task

…plify build.sh to copy-only Merges the core branch's three corrections: - Skill moved out of .github/skills/ into sdk/agentserver/docs/ (standalone artifact, devs copy independently). - @task preview wheels checked into sdk/agentserver/wheels/. - USING_PRE_RELEASE_WHEELS.md framing fixed (packages ARE on PyPI; @task primitive is private preview). Demo-specific changes that follow from the above: build.sh - No longer invokes sdk/agentserver/scripts/build-wheels.sh. - Just copies the checked-in central wheels into the per-sample gitignored docker-build staging dir. Faster, no compilation. README.md - Deploy section: 'stage the checked-in @task preview wheels' (not 'build agentserver wheels'). Adds a note that @task is private preview and the wheels are how you get it. - File-structure blurb: matches the new copy-only build.sh. .gitignore - Merged the demo-local Docker-staging entry with the existing .azure / .demo-session entries from this branch. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…rver-durable-agent-demo

…eel docs Following the core-branch reorganization that moved sdk/agentserver/docs/USING_PRE_RELEASE_WHEELS.md → sdk/agentserver/wheels/README.md, update the demo's links and a build.sh comment to the new path. No behavior change. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

PROBLEM The demo's previous live_stream tracked event_id with a per-invocation counter (event_id starts at 0 on each new GET). Combined with the single-consumer queue contract, a client reconnect with ?last_event_id=N could not deterministically resume — the meaning of event_id N depended on the queue's current state, not the actual emission position. Concretely observed: with last_event_id=8092 on a long-running task, a reconnect landed at phase 8's mid-content (not the next event after 8092) because (a) prior consumers had dequeued items the new GET could not see, and (b) the new live_stream counted from 1 again, advancing through whatever was currently in the queue. FIX (smallest possible) 1. FileStreamHandler now tracks a single _next_event_id counter incremented on every disk-line append — preload from disk on __init__, normal put, and the __done__ sentinel in close. Items go onto the queue as (event_id, item) tuples instead of bare items. event_id == disk row number == durable across restarts, recovery, and consumers. 2. app.py live_stream unpacks (event_id, chunk) tuples and uses the durable event_id directly when forming the SSE 'id: N' header. skip_count semantics are now correct: items with event_id <= skip_count are skipped; the rest are emitted with their durable id. 3. Defensive non-tuple unpack path keeps the GET handler safe if the FileStreamHandler is ever swapped for a stock QueueStreamHandler that emits bare items. ACCEPTED LIMITATION If a prior consumer has drained items the new GET expected to see, those items are simply not emitted (queue is single-consumer per the framework's StreamHandler contract — there's no way to backfill from disk without a larger refactor). Per user direction: 'one or two delta misses are acceptable; just be graceful.' We achieve that — the new GET emits whatever is currently in the queue and resumes cleanly from there. SMOKE TEST RESULT (v32 deploy) - Fresh GET: ids 1..1973 ✓ - Resume last_event_id=1973: starts at 1974, exact continuation ✓ - Resume last_event_id=10 after drain: starts at 2011 (gap skipped gracefully, no error, monotonic forward progress) ✓ - Drain to 2978 then resume from 1489: starts at 2979 (graceful gap skip, ids strictly monotonic) ✓ file_replay path already used disk-line counting — no change needed there; live_stream and file_replay now agree on the event_id space. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…ithin seconds, not minutes PROBLEM User reported: after issuing ./demo-client.sh crash, the SSE stream on the original terminal kept showing events for *minutes* before the disconnect surfaced. This was not a server, proxy, or TCP buffering issue — it was the demo-client renderer itself building a backlog. ROOT CAUSE Each rendered event was spawning python3 subprocesses: * etype detection — 1 python3 per event (~30ms) * _now_utc() — 1 'date' subprocess per event (~5ms) * Token content — 1 python3 per token (~30ms) For the token hot path that meant ~65ms per token. LLMs emit at 50-100 tok/s, so the renderer was running at ~10% of the server's emit rate. The kernel TCP buffer + curl + bash pipe accumulated a backlog that grew ~9 seconds per second of LLM streaming. When the server crashed, that backlog still had to drain through the slow renderer before the EOF on curl reached the bash 'while read' loop. Measured before: 100 token renders = 9.7s 1000 token renders = 51s 5000 token renders = timed out at 90s FIX (minimal, no behavior change) - etype detection: bash regex on the JSON instead of python3. - _now_utc(): moved from top-of-render_event into only the cases that actually use it (token + subcall_end don't need wall-clock). - Token content extraction: bash regex + parameter-expansion unescape for the four common JSON escapes (\\, \", \n, \t, \r). Token literal \uXXXX would print as the raw escape; that's acceptable for a demo. Measured after: 5000 token renders = 1.17s (~0.23ms per token, ~220x faster) phase_start render = 253ms (still uses _jq; happens 1/3min) Effect: renderer is now ~50x faster than the LLM emit rate, so no backlog builds. When the server crashes the client sees EOF within its normal poll interval and surfaces the disconnect within seconds. No server-side change. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…atchdog PROBLEM After the previous renderer-speedup, user reported 20-30s latency between issuing a crash and seeing the stream disconnect, even early in phase 1 — and that the latency seems to grow with longer streams. INVESTIGATION Built a localhost SSE server + bash client loop and measured. The bash renderer is actually fast enough (3300 tok/s drain, 12ms post-EOF detect on a clean close). So the residual latency is NOT in the bash hot path. Two likely causes left: 1. The platform edge proxy between the server container and the client buffers SSE responses and may hold the TCP connection open after the backend dies — there is no client-side way to speed up the EOF in this case. 2. printf-per-token to a real interactive terminal (vs the /dev/null benchmark) has per-call overhead the renderer cannot amortize. FIX Replace the bash 'while read | render_event' loop with a single long-lived python renderer. python is fundamentally better-suited for line-rate streaming with batching: - In-memory token buffer flushed every ~50ms instead of a printf-per-token (~20x fewer terminal syscalls in steady state). - select() + idle-timer in one loop: tokens batch under load, block events render immediately, and an idle watchdog fires after STALL_SECS of no inbound data. - When the watchdog fires the renderer SIGTERMs curl (its PID is passed via env var) so the bash pipeline exits within a couple hundred ms of the warning, regardless of whether the platform proxy is still holding the socket open. The renderer is embedded inline in demo-client.sh as a heredoc (_PY_RENDERER); no separate file. ANSI color codes and event-type formatting match the previous bash implementation exactly. The bash render_event + _jq helpers are deleted (no longer used). Most of stream_sse is gone too — replaced by a small wrapper that launches curl in the background to capture its PID and feeds its output to python via a FIFO. KNOBS (env) STALL_SECS default 10 — stream-idle threshold for the watchdog FLUSH_MS default 50 — token-buffer flush cadence VERIFIED LOCALLY (test harness against a python SSE server) Happy path: 50-token stream, clean close - Total wall: 1.04s (matches server emit time) - STREAM_RESULT=complete, LAST_EVENT_ID propagates correctly Stall path: 200 tokens, then server hangs (proxy-hang simulation) - Tokens render smoothly during emission - 5s after last token the watchdog warns and SIGTERMs curl - Bash pipeline exits in 9s total (was 24s before the kill-curl fix, would have been 25s+ in production until proxy timed out) All renderer output (run_start/phase_start/subcall_start/tokens/ phase_end/run_complete/done) renders with proper formatting, timestamps, and colors. No server-side change. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…md_steer) The previous commit (python renderer) deleted render_event + _jq together because both were used by the bash SSE consumer that python replaced. But cmd_start and cmd_steer still call _jq to extract invocation_id / session_id from the one-shot POST response — a small helper, not part of the streaming hot path. Restored the helper with an updated docstring that calls out its narrowed scope. Symptom: 'demo-client.sh: line 367: _jq: command not found' on ./demo-client.sh start, followed by an empty INV_ID. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…lf window FALSE-POSITIVE OBSERVED User reported: ./demo-client.sh start emitted research subcall 1/4 then triggered '⚠ stream stalled (no events for 10s)' even though no crash occurred. Root cause: the hosted agent.yaml sets INTRA_PHASE_COOLDOWN_SEC=30 and INTER_PHASE_COOLDOWN_SEC=30, so there are legitimately ~30s silent periods between subcalls and between phases (asyncio.sleep with no events emitted). A 10s watchdog therefore mis-fires during normal operation. FIX 1. Default STALL_SECS bumped 10 -> 60, comfortably above the longest planned silence (30s). Crash detection latency goes from 10s to ~60s in exchange for zero false positives during normal runs. Still better than the 20-30s baseline behavior the user saw before any watchdog at all. 2. Added a low-key hint when idle crosses HALF the stall window. Prints '...quiet for Ns (stall threshold 60s)' once every 10s, so the user sees the renderer is alive but quiet during cooldowns instead of wondering if it hung. 3. Hint counter resets every time data arrives, so back-to-back short cooldowns do not pile up hints. VERIFIED locally Server: emit run_start, then 40s silence, then run_complete + close Client: STALL_SECS=60 [00:00] run_start banner [00:30] '...quiet for 30s (stall threshold 60s)' [00:40] run_complete renders, STREAM_RESULT=complete Both knobs remain env-overridable (STALL_SECS, FLUSH_MS). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…t SoT User feedback: 'Why is the watchdog using a time-based idleness as crash? Shouldnt we use the connection closure itself as the SOT?' They are right. EOF on the curl pipe is the authoritative crash/disconnect signal — TCP close happens when the server (or its upstream proxy) terminates the SSE response. A time-based watchdog duplicates that signal, mis-fires during legitimate quiet periods (this demo has 30s cooldowns between subcalls and phases — see INTRA_PHASE_COOLDOWN_SEC / INTER_PHASE_COOLDOWN_SEC in agent.yaml), and forces every operator to tune cooldown-vs-detection-threshold. REMOVED - STALL_SECS env var and all its logic - The 'half-window quiet hint' (only made sense alongside the watchdog) - last_data_at and last_idle_hint state - CURL_PID plumbing (no need to SIGTERM curl when there is no watchdog to force-close it) - mkfifo / background-curl dance in stream_sse — now a plain pipe KEPT - FLUSH_MS token-buffer flush cadence (50ms) — still real and useful, it batches terminal writes so the renderer keeps pace with LLM emit rate. - All ANSI formatting, event-type rendering, event_id passthrough. EOF flow (the only disconnect path now) curl sees TCP close -> closes its stdout -> python's select() returns ready -> os.read returns b'' -> renderer flush_tokens + break out of while loop -> finally writes STATE_FILE -> bash sources state -> STREAM_RESULT=disconnected (or 'complete' if we saw run_complete / done first) -> _report_stream_result prints the right banner. VERIFIED locally Happy path (clean close + run_complete): wall=1.05s, STREAM_RESULT=complete ✓ Abrupt close (server emits 50 tokens then closes socket without emitting done): wall=1.04s (matches server timing exactly), STREAM_RESULT=disconnected, no false 'stalled' warning ✓ Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Two user-reported issues, both addressed at the agent layer (no framework changes): 1) The 30s cooldowns between subcalls / phases made the terminal go silent — felt like nothing was happening. 2) Phase-level checkpointing meant the user had to wait ~5 min for the first phase to finish before crash testing was meaningful (else recovery just restarted phase 1 from scratch and the demo looked like nothing happened). CHANGES agent.py — subcall-level checkpoints - The handler now persists {in_progress_phase, completed_subcalls, current_text} on top of the prior {completed_phases, results} state. After each LLM subcall returns we flush to ctx.metadata. - On recovery (ctx.entry_mode == 'recovered'), if we crashed mid-phase we resume that same phase at the next un-finished subcall, re-using the text we had already produced. - Worst-case work lost on crash drops from ONE FULL PHASE (~3 min + 3 wasted LLM subcalls) to ONE SUBCALL (~30-60s + 1 LLM subcall). Crash testing is now meaningful at any point in the run, not just after a phase boundary. - Phase-complete checkpoint additionally clears the in-progress fields so the next phase starts cleanly. agent.py — cooldown events - New _cooldown(ctx, duration, stage, phase, subcall=, of=) helper that emits a 'cooldown' SSE event before the asyncio sleep: {type:cooldown,duration_sec:30,stage:intra_phase, phase:2,total:15,subcall:3,of:4, ...} - Replaces the bare asyncio.wait_for in both the intra-phase (between subcalls) and inter-phase (between phases) cooldowns. - The wait stays cancel-aware (steering / operator cancel still short-circuit the cooldown). demo-client.sh — cooldown renderer - Added a 'cooldown' case to the python renderer that prints a single dim line, e.g. [18:00:42Z] ...cooling down 30s (between subcalls) — next: subcall 3/4 in phase 2/15 - One line per cooldown, no spam. README — updated the 'what the agent does' blurb to reflect: - Checkpoints are now per-subcall (not per-phase). - Cooldowns emit visible SSE events. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

… heredoc) Symptom (user-reported): Traceback ... NameError: name 'duration_sec' is not defined Root cause: my previous commit added the cooldown event renderer with a Python string literal using single quotes: evt.get('duration_sec', 0) The single quotes prematurely terminated the surrounding bash heredoc (_PY_RENDERER=apostrophe...apostrophe), so the runtime python source was silently truncated. Bash quote concatenation made it look like a NameError on duration_sec several lines later in the parsed script. Fix - Alias the dict key as a module-level constant _DSEC = 'duration_sec' (with double quotes, safe). Use evt.get(_DSEC, ...) at the call site. - Add a CRITICAL header comment explaining the gotcha so future edits do not reintroduce apostrophes. The header itself is reworded to avoid using the literal character. - Reword the inline NOTE comment for the same reason. Verified - bash -n parses - python ast.parse on the extracted heredoc parses - Functional smoke: phase_end and cooldown events render correctly, duration_sec extracts and formats as expected. No server-side change. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…ints + cooldown events) Captures the v31 deploy that ships the subcall-level checkpointing and cooldown-event emission from commit 2925f1d. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

github-actions Bot added the Hosted Agents sdk/agentserver/* label Jun 2, 2026

RaviPidaparthi changed the base branch from main to feature/agentserver-durable-tasks June 2, 2026 03:33

RaviPidaparthi and others added 11 commits June 2, 2026 03:46

Merge branch 'feature/agentserver-durable-tasks' into feature/agentse…

a9f1ae5

…rver-durable-agent-demo

Merge branch 'feature/agentserver-durable-tasks' into feature/agentse…

45f3ec3

…rver-durable-agent-demo

Merge branch 'feature/agentserver-durable-tasks' into feature/agentse…

b1b50f1

…rver-durable-agent-demo

RaviPidaparthi force-pushed the feature/agentserver-durable-agent-demo branch from 756e0fe to 2553746 Compare June 2, 2026 22:50

RaviPidaparthi and others added 15 commits June 2, 2026 22:52

Merge branch 'feature/agentserver-durable-tasks' into feature/agentse…

169f0e9

…rver-durable-agent-demo

Merge branch 'feature/agentserver-durable-tasks' into feature/agentse…

6356ff2

…rver-durable-agent-demo

Merge branch 'feature/agentserver-durable-tasks' into feature/agentse…

41e1ef9

…rver-durable-agent-demo

RaviPidaparthi and others added 3 commits June 3, 2026 18:23

[agentserver] demo: sync azd env state to v31 deploy (subcall checkpo…

f46b089

…ints + cooldown events) Captures the v31 deploy that ships the subcall-level checkpointing and cooldown-event emission from commit 2925f1d. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

demo(agentserver): TEMPORARY - durable-agent-demo (split out of #46997, never-merged)#47276

demo(agentserver): TEMPORARY - durable-agent-demo (split out of #46997, never-merged)#47276
RaviPidaparthi wants to merge 30 commits into
feature/agentserver-durable-tasksfrom
feature/agentserver-durable-agent-demo

RaviPidaparthi commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

RaviPidaparthi commented Jun 2, 2026

Summary — TEMPORARY / DO NOT MERGE

Status

Scope

What this branch needs before any potential reuse

Pointers

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant