demo(agentserver): TEMPORARY - durable-agent-demo (split out of #46997, never-merged)#47276
Draft
RaviPidaparthi wants to merge 30 commits into
Draft
Conversation
This commit restores the azd-deployable durable-agent-demo (34 files) that was moved out of the core PR (#46997) to keep scope manageable. Sits on top of the core PR branch so it only shows the demo delta. 🚨 TEMPORARY — this PR is NOT intended for merge. The demo lives here purely so it isn't lost from the working set; we use it as a reference deployment while the durable-task primitive matures. The distilled invocations sample (samples/durable_research) derived from this demo ships in PR #46997 instead. Restored from safety-spec016-backup-2026-06-02 (SHA 3df9c5b). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…rver-durable-agent-demo
…rver-durable-agent-demo
…16 core
Three call-site updates to align the demo with the spec 016 public surface:
1. Drop TaskTerminated from imports (it was removed from the public
surface — TaskCancelled now covers cooperative-cancel paths).
2. Drop session_id= from deep_research.start() — session is platform-
derived from FOUNDRY_AGENT_SESSION_ID, not a per-call argument.
3. Await deep_research.get_active_run(task_id) — it's now an async
method (the framework needs to consult the task store, not just
in-memory state) so the previous synchronous call returned a
coroutine, not a TaskRun.
Also refreshes the bundled wheels (b4 -> core 2.0.0b6 + invocations
1.0.0b5) and the azd env state from a fresh 'azd up' deployment
against the e2e-tests-westus2 Foundry project.
Verified end-to-end against the deployed agent:
./demo-client.sh start "durable tasks demonstration"
-> streams stages 1/12, 2/12, ... live via SSE
./demo-client.sh crash
-> {"status":"crashing"}; supervisor restarts the container
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…story The platform now provides two capabilities that obsolete our application-level infrastructure: * nanny worker restarts the container within ~5-10 min of a crash * lease-renewal on @task internally pings /readiness so the sandbox stays alive as long as a durable task is executing (no need for client traffic) This commit rewrites the durable-research-agent demo around those guarantees and adds steering as a third headline capability. Removed: * supervisor.py (170 lines) — the PID-1 reverse proxy + restart loop * entrypoint.sh — the auto-restart bash wrapper * aiohttp production dep (only supervisor used it) * Dockerfile FOUNDRY_TASK_API_ENABLED (auto-selected now per dev guide) * Multi-shape input parsing (per feedback: it's a demo; stick to one shape) Container now runs 'python app.py' directly; CMD changed accordingly and azure.yaml startupCommand updated to match. agent.py rewrite (~221 lines): * 15 phases x 4 LLM sub-calls (research -> critique -> refine -> synthesize) targeting ~45 min total wall time (gpt-4.1-mini, 1500 output tokens/call). Env-overridable: NUM_PHASES, CALLS_PER_PHASE, TARGET_OUTPUT_TOKENS, INTRA_PHASE_COOLDOWN_SEC, INTER_PHASE_COOLDOWN_SEC. * Every phase emits phase_start + phase_end events with server_time_utc (UTC ISO8601 with ms) and server_uptime_sec. The uptime resets to ~0 after the platform nanny restarts the container — so a viewer can SEE the crash recovery happen in the stream. * @task(steerable=True). On every checkpoint boundary the handler checks ctx.cancel.is_set() and (when pending_input_count > 0) emits a winding_down event with cause + returns ctx.suspend(). The framework drains the next steering input as a fresh turn. * Topic-change detection at handler entry resets checkpoint state when the steered topic differs from the previously stored one. app.py rewrite: * task_id = session_id (was invocation_id) so steering routes correctly: second POST on the same session hits the active task and queues input. * POST /invocations with {"message": "crash"} (DEMO_MODE=1) exits the process so the platform nanny restarts the container. The platform only proxies /invocations* — we can't add custom routes. * GET /invocations/{id} falls back to file replay when no live run is present, so reconnecting after the task completes still shows the full transcript (regression fix per review feedback). * session_id read from app.config.session_id when not on request.state (GET state doesn't carry it). demo-client.sh rewrite: * Pretty SSE renderer that recognises the new event types (run_start, recovered, phase_start/end, subcall_start/end, winding_down, run_complete) and box-prints the timestamps. * Commands: start, stream (reconnect), steer, crash, cancel, status, logs, reset. No more auto-reconnect spam — disconnects suggest manual reconnect, matching the long-run / no-ingress demo flow. * Three-terminal demo workflow documented in --help and README. README rewrite: * Documents the three capabilities (long-running > 15 min, crash recovery via platform nanny ~5-10 min, steering). * New A/B/C demo walkthroughs (long-run no-ingress, crash-recovery, steering). * Architecture diagram drops supervisor; lists the platform-managed behaviors (nanny, lease renewal) explicitly. * Env var table reflects new tuning knobs. Verified end-to-end against the deployed agent (e2e-tests-westus2 / durable-research-agent): * Streaming with timestamps -> all event types render correctly * Steering -> 'Steering drain: task drained next input' in server logs; new turn runs new topic * GET after completion -> HTTP 200 + SSE file replay * Crash dispatch -> POST returns 202; next POST gets 424 (container down, awaiting nanny restart) Refreshed wheels (core 2.0.0b6 + invocations 1.0.0b5) and azd env state. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… detection
Two changes after platform-behavior verification testing:
1. README / app.py / demo-client.sh: correct the platform-restart story.
Earlier docs said 'platform nanny restarts the container within
~5-10 minutes' — autonomous. Empirical observation during testing
showed the actual behavior is **ingress-triggered**:
* Crashed containers stay down with NO ingress.
* The next inbound request triggers the platform to bring the
container back, which happens in ~10 seconds (much faster than
the 5-10 min worst-case figure).
* The durable task then auto-recovers from its last checkpoint.
Verified by waiting 16 min after crash with zero ingress, then
reconnecting and observing the container started 11 sec AFTER my
reconnect GET (server logs: 'AgentServerHost started 21:03:18' for
reconnect at 21:03:07). User-facing experience is unchanged: any
reconnect attempt seamlessly restores the task.
(The lease-renewal-keeps-sandbox-alive story is also verified —
Test B showed phases progressing from uptime 47s -> 569s linearly
with no resets during a 9.5-min no-ingress window. The framework's
internal lease-renewal cycle ingresses /readiness internally, which
keeps the sandbox alive while the @task is executing.)
2. agent.py _wind_down(): change cause detection to use exclusion.
ctx.pending_input_count is often back to 0 by the time the wind-down
triggers (the framework drained the steering input before we
observed). Detect by elimination instead: if neither timeout nor
operator_cancel, it must be steering. Removes the bogus 'unknown'
cause we saw in steer-test output.
Verified end-to-end against the deployed agent:
Test A (crash recovery, no-ingress):
20:42:32 dispatched 'carbon capture technology'
20:46:52 crash (after 4 phases done, uptime 205s)
20:46:52 + 16 min: NO ingress
21:03:07 reconnected with GET
21:03:18 container started (uptime 1.3s), task recovered from
phase 5 checkpoint, resumed at phase 6
Test B (lease keeps sandbox alive):
20:23:59 dispatched 'supply chain resilience'
20:23:59 + 17 min: NO ingress
During wait: phases 1-10 all completed; uptime grew 1.9s -> 569s
linearly (no restarts during the no-ingress window)
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Removed the 'sandbox stays alive while @task executes' claim from the README's headline capabilities — empirical testing showed this is NOT happening on the current platform deployment. What we verified: * Container is reclaimed at exactly the 15-min mark since the last user-facing ingress, regardless of whether a @task handler is actively running. * Framework's lease-renewal cycle goes to the task-store API (PATCH /api/projects/.../tasks/{id}), NOT to the agent container's /readiness endpoint. So lease renewal doesn't reset the platform's idle timer. * Crashed/reclaimed containers stay down with zero ingress. * The next ingress request brings the container back in ~10 sec. * Durable task auto-resumes from last checkpoint (entry_mode='recovered', correct completed_phases). README now describes the demo as two capabilities: 1. Crash + idle recovery — any reconnect after a crash or 15-min idle reclaim seamlessly resumes from the last checkpoint 2. Steering — mid-run topic switch via cooperative wind-down Long-running tasks DO complete (just by being reclaimed-and-recovered repeatedly rather than running uninterrupted on a single container). Section A reframed accordingly. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Three issues from review: 1. Session-id lifecycle was unclear. Added a 'Session-id lifecycle' subsection explaining that 'start' allocates a new UUID and writes it to .demo-session; stream/steer/crash/cancel/logs/status all reuse it; 'reset' clears it so the next 'start' allocates a fresh one. 2. Log inspection wasn't documented. Added an 'Inspecting container logs' subsection that points to './demo-client.sh logs' and 'azd ai agent monitor', and enumerates the most useful framework log lines (TaskManager starting, Reclaimed/Recovered task, /readiness probe, OpenAI HTTP requests, Steering drain). 3. Architecture diagram was stale. Removed 'POST /demo/crash' (no such route — platform only proxies /invocations*), removed the false 'lease renewal pings /readiness' callout, and added a clearer diagram that shows the Foundry control plane separately and calls out the actual mechanisms (lease renewal goes to task-storage API, /readiness is hit only by platform startup probe, container revival is ingress-triggered). Also added an upfront command-reference table covering every demo-client.sh subcommand, and fixed the env-var doc to reflect that overrides happen via Dockerfile/azure.yaml (the container runs the shipped image, not a local python app.py). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…rver-durable-agent-demo
Picks up the new TaskRun.__await__ method from the core branch
(merged in). With this, callers of get_active_run / start can await
the returned TaskRun directly to get the TaskResult, removing a
pyright squiggle on:
run = await deep_research.get_active_run(task_id)
No changes to the demo's app.py or agent.py — they already use the
correct pattern. This is purely refreshing the bundled wheels so the
deployed agent picks up the new core build.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…r imports
The demo dir lives under the invocations package which has a
pyrightconfig.json that excludes samples/** but still applies its
rules to opened files. When the IDE opens app.py / agent.py, it
couldn't find the editable-installed agentserver packages without an
explicit venvPath / venv setting.
Adding a demo-local pyrightconfig.json that:
* points venv at the repo's .venv (via the relative path)
* suppresses reportMissingImports / reportAttributeAccessIssue
(the in-tree editable install is enough; the imports work; we
don't need warnings telling us otherwise on a demo)
* keeps the meaningful checks (Optional access, argument type,
general type issues, return type)
Verified: pyright runs clean from the demo dir with this config
(0 errors, 1 informational warning on .output Optional access).
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
I added a demo-local pyrightconfig.json earlier in this session to
work around an IDE squiggle. Root cause was much simpler: the venv
just had an OLD wheel (2.0.0b4) cached from way back. Reinstalling
the new 2.0.0b6 wheel (which has TaskRun.__await__) in the venv
makes everything resolve correctly without any pyright config
changes — the IDE was working fine before; this restores that.
Reinstall command:
pip uninstall -y azure-ai-agentserver-core azure-ai-agentserver-invocations
pip install sdk/agentserver/azure-ai-agentserver-core \
sdk/agentserver/azure-ai-agentserver-invocations
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
756e0fe to
2553746
Compare
…rver-durable-agent-demo
The previous attempt to set FOUNDRY_TASK_API_ENABLED was rejected by the hosting platform (FOUNDRY_*/AGENT_* are reserved namespaces). Core has been updated to use AGENTSERVER_TASK_API_ENABLED instead — apply that here and refresh the bundled wheels. Effect: the demo container now uses HostedTaskProvider, so /tasks HTTP calls (lease renewals, readiness pings, state PATCHes) flow through the TaskApiLoggingPolicy and show up in 'demo-client.sh logs' as 'task-store request: ...' lines. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… validation
Captures the v25 deploy that exercised the lease-renewal + nanny-restore
validation:
Test 1 — lease keeps sandbox alive >15 min without client ingress: PASS
Same lease_instance_id for 46+ min, 12 phases completed, only platform
/liveness probes and our framework's PATCH .../tasks/<id> lease
renewals (every ~30s) kept the sandbox warm.
Test 2 — nanny restores crashed sandbox within ~15 min, zero ingress: PASS
Crashed at 00:02:04Z; new worker came up at 00:02:47Z (43s later);
durable task auto-resumed with entry_mode='recovered' from the last
checkpoint (completed_phases: 2); progressed through 4 more phases
with no client ingress.
agent.yaml is back to default cooldowns (10/20s); only the
AGENTSERVER_TASK_API_ENABLED=1 opt-in is retained (committed earlier
in 1b1e334).
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… validated behavior Sets hosted-mode INTRA_PHASE_COOLDOWN_SEC=30 and INTER_PHASE_COOLDOWN_SEC=30 in agent.yaml so the deployed durable-research-agent runs for ~33 min (15 phases × (~12s LLM + 3×30s intra + 30s inter)). The run intentionally exceeds the platform's 15-min sandbox-eviction window so each demo run exercises the framework's lease-renewal keep-alive path end-to-end — which is the whole point of @task durability and what we just validated empirically against e2e-tests-westus2. agent.py defaults (10/20s = ~15 min) are kept for local/dev iteration where the long wall-time isn't useful. README updates reflect what we proved (rather than what we previously assumed): - Recovery section now leads with 'long-running tasks survive past 15 min via lease-renewal keep-alive' as a first-class platform capability, not buried in a doubt-laden footnote. - Removed the 'Note on long-running tasks' disclaimer that claimed lease renewals do NOT extend the idle window — empirical evidence shows otherwise (Test 1: 46-min uptime, same instance throughout, zero client ingress after T=0). - Workflow A retitled 'Long-running run with no client-side keepalive' and rewritten to reflect: reconnecting after 25 min finds the SAME instance, not a recovered fresh one. - Workflow B (crash) reflects the nanny does the restore on its own within ~1 min — no client ingress required to bring the container back; the durable task auto-resumes inside the new process. - Architecture diagram's 'Idle-reclaim timer' note now explains it is kept fresh by framework lease-renewal traffic. - Env-var table now lists hosted vs agent.py defaults separately and includes AGENTSERVER_TASK_API_ENABLED with explanation. - Fast-dev-loop block now points at agent.yaml (not the Dockerfile) since env vars live in agent.yaml now. azd state synced to the v26 deploy that ships these settings. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…rver-durable-agent-demo
…renderer + add client wall-clock
Core branch now auto-enables HostedTaskProvider in hosted environments,
so this demo no longer needs AGENTSERVER_TASK_API_ENABLED. Likewise,
wheels are now built centrally via sdk/agentserver/scripts/build-wheels.sh
and staged into the docker build context — no committed wheels.
CHANGES
agent.yaml
- Drop AGENTSERVER_TASK_API_ENABLED (auto-on in hosted).
- Tighten the cooldown comment (no behavior change).
build.sh
- Delegate to the central sdk/agentserver/scripts/build-wheels.sh.
- Stage wheels into src/durable-research-agent/wheels/ (gitignored
docker-build dir), so the Dockerfile's COPY wheels/ ... still
finds them at build time.
- Per-sample build.sh is now a thin staging wrapper; no per-sample
duplication of the build logic.
src/durable-research-agent/wheels/*.whl (deleted)
- Wheels are no longer committed. They're regenerated on demand.
app.py — fix file_replay SSE double-encoding
- FileStreamHandler.put writes json.dumps(item)+'\n', where item
is itself a JSON string from ctx.stream(json.dumps({...})). The
live_stream path correctly reads from the in-memory queue (which
holds the original string). The file_replay path read the disk
line via json.loads, then RE-WRAPPED with json.dumps before
embedding in 'data: ...\n\n' — producing
data: "{\"type\": \"...\"}"
which the client rendered as '[unknown event] "{\"...\"}"'.
- Decode once, embed the raw JSON string directly. Also add an
isinstance check before the __done__ key lookup (the decoded
value is a string for normal events).
- Update crash-handler 202 response message + docstring to reflect
validated behavior (nanny restores ~1 min, no ingress needed).
demo-client.sh
- Add _now_utc() helper and prefix every block-style event with
'[HH:MM:SSZ]' — the client's local UTC wall-clock at render
time — so users can compare against server_time= (server-side
UTC) and uptime= (server process seconds-since-boot) for a
clear timeline of phases vs lease renewals vs recoveries.
- Update header comment: drop the wrong '~5-10 min' nanny restore
and the wrong 'lease renewal pings readiness' phrasing; reflect
the validated 30s lease cadence and ~1 min nanny window.
- Three-terminal usage example: ~33 min (not 45) wall-time per
run; nanny restores ~1 min after crash (no need to send any
ingress to trigger recovery).
- Crash-command output text: nanny brings container back on its
own, no client action required.
README.md
- Capability #1 reframed: lease keep-alive proven end-to-end
(e2e-tests-westus2), 33-min runs with zero client ingress.
- Capability #2 reframed: nanny restores within ~1 min (43s
measured) without any client ingress; recover-on-reconnect was
a misread of the old behavior.
- Deploy section: build.sh now delegates to the central script;
points at USING_PRE_RELEASE_WHEELS.md for the wheel workflow.
- Crash command row in the command-reference table: clearer wording
around nanny-driven recovery.
- Env-var table: drop AGENTSERVER_TASK_API_ENABLED row (gone);
add a paragraph clarifying that hosted/local provider selection
is automatic.
- File-structure section: build.sh and wheels/ entries reflect the
new layout; add pointer to the wheel-distribution doc.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…plify build.sh to copy-only
Merges the core branch's three corrections:
- Skill moved out of .github/skills/ into sdk/agentserver/docs/
(standalone artifact, devs copy independently).
- @task preview wheels checked into sdk/agentserver/wheels/.
- USING_PRE_RELEASE_WHEELS.md framing fixed (packages ARE on PyPI;
@task primitive is private preview).
Demo-specific changes that follow from the above:
build.sh
- No longer invokes sdk/agentserver/scripts/build-wheels.sh.
- Just copies the checked-in central wheels into the per-sample
gitignored docker-build staging dir. Faster, no compilation.
README.md
- Deploy section: 'stage the checked-in @task preview wheels' (not
'build agentserver wheels'). Adds a note that @task is private
preview and the wheels are how you get it.
- File-structure blurb: matches the new copy-only build.sh.
.gitignore
- Merged the demo-local Docker-staging entry with the existing
.azure / .demo-session entries from this branch.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…rver-durable-agent-demo
…eel docs Following the core-branch reorganization that moved sdk/agentserver/docs/USING_PRE_RELEASE_WHEELS.md → sdk/agentserver/wheels/README.md, update the demo's links and a build.sh comment to the new path. No behavior change. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
PROBLEM The demo's previous live_stream tracked event_id with a per-invocation counter (event_id starts at 0 on each new GET). Combined with the single-consumer queue contract, a client reconnect with ?last_event_id=N could not deterministically resume — the meaning of event_id N depended on the queue's current state, not the actual emission position. Concretely observed: with last_event_id=8092 on a long-running task, a reconnect landed at phase 8's mid-content (not the next event after 8092) because (a) prior consumers had dequeued items the new GET could not see, and (b) the new live_stream counted from 1 again, advancing through whatever was currently in the queue. FIX (smallest possible) 1. FileStreamHandler now tracks a single _next_event_id counter incremented on every disk-line append — preload from disk on __init__, normal put, and the __done__ sentinel in close. Items go onto the queue as (event_id, item) tuples instead of bare items. event_id == disk row number == durable across restarts, recovery, and consumers. 2. app.py live_stream unpacks (event_id, chunk) tuples and uses the durable event_id directly when forming the SSE 'id: N' header. skip_count semantics are now correct: items with event_id <= skip_count are skipped; the rest are emitted with their durable id. 3. Defensive non-tuple unpack path keeps the GET handler safe if the FileStreamHandler is ever swapped for a stock QueueStreamHandler that emits bare items. ACCEPTED LIMITATION If a prior consumer has drained items the new GET expected to see, those items are simply not emitted (queue is single-consumer per the framework's StreamHandler contract — there's no way to backfill from disk without a larger refactor). Per user direction: 'one or two delta misses are acceptable; just be graceful.' We achieve that — the new GET emits whatever is currently in the queue and resumes cleanly from there. SMOKE TEST RESULT (v32 deploy) - Fresh GET: ids 1..1973 ✓ - Resume last_event_id=1973: starts at 1974, exact continuation ✓ - Resume last_event_id=10 after drain: starts at 2011 (gap skipped gracefully, no error, monotonic forward progress) ✓ - Drain to 2978 then resume from 1489: starts at 2979 (graceful gap skip, ids strictly monotonic) ✓ file_replay path already used disk-line counting — no change needed there; live_stream and file_replay now agree on the event_id space. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ithin seconds, not minutes PROBLEM User reported: after issuing ./demo-client.sh crash, the SSE stream on the original terminal kept showing events for *minutes* before the disconnect surfaced. This was not a server, proxy, or TCP buffering issue — it was the demo-client renderer itself building a backlog. ROOT CAUSE Each rendered event was spawning python3 subprocesses: * etype detection — 1 python3 per event (~30ms) * _now_utc() — 1 'date' subprocess per event (~5ms) * Token content — 1 python3 per token (~30ms) For the token hot path that meant ~65ms per token. LLMs emit at 50-100 tok/s, so the renderer was running at ~10% of the server's emit rate. The kernel TCP buffer + curl + bash pipe accumulated a backlog that grew ~9 seconds per second of LLM streaming. When the server crashed, that backlog still had to drain through the slow renderer before the EOF on curl reached the bash 'while read' loop. Measured before: 100 token renders = 9.7s 1000 token renders = 51s 5000 token renders = timed out at 90s FIX (minimal, no behavior change) - etype detection: bash regex on the JSON instead of python3. - _now_utc(): moved from top-of-render_event into only the cases that actually use it (token + subcall_end don't need wall-clock). - Token content extraction: bash regex + parameter-expansion unescape for the four common JSON escapes (\\, \", \n, \t, \r). Token literal \uXXXX would print as the raw escape; that's acceptable for a demo. Measured after: 5000 token renders = 1.17s (~0.23ms per token, ~220x faster) phase_start render = 253ms (still uses _jq; happens 1/3min) Effect: renderer is now ~50x faster than the LLM emit rate, so no backlog builds. When the server crashes the client sees EOF within its normal poll interval and surfaces the disconnect within seconds. No server-side change. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…atchdog
PROBLEM
After the previous renderer-speedup, user reported 20-30s latency
between issuing a crash and seeing the stream disconnect, even early
in phase 1 — and that the latency seems to grow with longer streams.
INVESTIGATION
Built a localhost SSE server + bash client loop and measured. The
bash renderer is actually fast enough (3300 tok/s drain, 12ms
post-EOF detect on a clean close). So the residual latency is NOT in
the bash hot path. Two likely causes left:
1. The platform edge proxy between the server container and the
client buffers SSE responses and may hold the TCP connection
open after the backend dies — there is no client-side way to
speed up the EOF in this case.
2. printf-per-token to a real interactive terminal (vs the
/dev/null benchmark) has per-call overhead the renderer cannot
amortize.
FIX
Replace the bash 'while read | render_event' loop with a single
long-lived python renderer. python is fundamentally better-suited
for line-rate streaming with batching:
- In-memory token buffer flushed every ~50ms instead of a
printf-per-token (~20x fewer terminal syscalls in steady state).
- select() + idle-timer in one loop: tokens batch under load,
block events render immediately, and an idle watchdog fires
after STALL_SECS of no inbound data.
- When the watchdog fires the renderer SIGTERMs curl (its PID is
passed via env var) so the bash pipeline exits within a couple
hundred ms of the warning, regardless of whether the platform
proxy is still holding the socket open.
The renderer is embedded inline in demo-client.sh as a heredoc
(_PY_RENDERER); no separate file. ANSI color codes and event-type
formatting match the previous bash implementation exactly.
The bash render_event + _jq helpers are deleted (no longer used).
Most of stream_sse is gone too — replaced by a small wrapper that
launches curl in the background to capture its PID and feeds its
output to python via a FIFO.
KNOBS (env)
STALL_SECS default 10 — stream-idle threshold for the watchdog
FLUSH_MS default 50 — token-buffer flush cadence
VERIFIED LOCALLY (test harness against a python SSE server)
Happy path: 50-token stream, clean close
- Total wall: 1.04s (matches server emit time)
- STREAM_RESULT=complete, LAST_EVENT_ID propagates correctly
Stall path: 200 tokens, then server hangs (proxy-hang simulation)
- Tokens render smoothly during emission
- 5s after last token the watchdog warns and SIGTERMs curl
- Bash pipeline exits in 9s total (was 24s before the kill-curl
fix, would have been 25s+ in production until proxy timed out)
All renderer output (run_start/phase_start/subcall_start/tokens/
phase_end/run_complete/done) renders with proper formatting,
timestamps, and colors.
No server-side change.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…md_steer) The previous commit (python renderer) deleted render_event + _jq together because both were used by the bash SSE consumer that python replaced. But cmd_start and cmd_steer still call _jq to extract invocation_id / session_id from the one-shot POST response — a small helper, not part of the streaming hot path. Restored the helper with an updated docstring that calls out its narrowed scope. Symptom: 'demo-client.sh: line 367: _jq: command not found' on ./demo-client.sh start, followed by an empty INV_ID. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…lf window
FALSE-POSITIVE OBSERVED
User reported: ./demo-client.sh start emitted research subcall 1/4
then triggered '⚠ stream stalled (no events for 10s)' even though no
crash occurred. Root cause: the hosted agent.yaml sets
INTRA_PHASE_COOLDOWN_SEC=30 and INTER_PHASE_COOLDOWN_SEC=30, so there
are legitimately ~30s silent periods between subcalls and between
phases (asyncio.sleep with no events emitted). A 10s watchdog
therefore mis-fires during normal operation.
FIX
1. Default STALL_SECS bumped 10 -> 60, comfortably above the longest
planned silence (30s). Crash detection latency goes from 10s to
~60s in exchange for zero false positives during normal runs.
Still better than the 20-30s baseline behavior the user saw before
any watchdog at all.
2. Added a low-key hint when idle crosses HALF the stall window.
Prints '...quiet for Ns (stall threshold 60s)' once every 10s,
so the user sees the renderer is alive but quiet during cooldowns
instead of wondering if it hung.
3. Hint counter resets every time data arrives, so back-to-back
short cooldowns do not pile up hints.
VERIFIED locally
Server: emit run_start, then 40s silence, then run_complete + close
Client: STALL_SECS=60
[00:00] run_start banner
[00:30] '...quiet for 30s (stall threshold 60s)'
[00:40] run_complete renders, STREAM_RESULT=complete
Both knobs remain env-overridable (STALL_SECS, FLUSH_MS).
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…t SoT
User feedback: 'Why is the watchdog using a time-based idleness as
crash? Shouldnt we use the connection closure itself as the SOT?'
They are right. EOF on the curl pipe is the authoritative
crash/disconnect signal — TCP close happens when the server (or its
upstream proxy) terminates the SSE response. A time-based watchdog
duplicates that signal, mis-fires during legitimate quiet periods
(this demo has 30s cooldowns between subcalls and phases — see
INTRA_PHASE_COOLDOWN_SEC / INTER_PHASE_COOLDOWN_SEC in agent.yaml),
and forces every operator to tune cooldown-vs-detection-threshold.
REMOVED
- STALL_SECS env var and all its logic
- The 'half-window quiet hint' (only made sense alongside the watchdog)
- last_data_at and last_idle_hint state
- CURL_PID plumbing (no need to SIGTERM curl when there is no
watchdog to force-close it)
- mkfifo / background-curl dance in stream_sse — now a plain pipe
KEPT
- FLUSH_MS token-buffer flush cadence (50ms) — still real and useful,
it batches terminal writes so the renderer keeps pace with LLM emit
rate.
- All ANSI formatting, event-type rendering, event_id passthrough.
EOF flow (the only disconnect path now)
curl sees TCP close -> closes its stdout -> python's select() returns
ready -> os.read returns b'' -> renderer flush_tokens + break out of
while loop -> finally writes STATE_FILE -> bash sources state ->
STREAM_RESULT=disconnected (or 'complete' if we saw run_complete /
done first) -> _report_stream_result prints the right banner.
VERIFIED locally
Happy path (clean close + run_complete):
wall=1.05s, STREAM_RESULT=complete ✓
Abrupt close (server emits 50 tokens then closes socket without
emitting done):
wall=1.04s (matches server timing exactly), STREAM_RESULT=disconnected,
no false 'stalled' warning ✓
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Two user-reported issues, both addressed at the agent layer (no
framework changes):
1) The 30s cooldowns between subcalls / phases made the terminal go
silent — felt like nothing was happening.
2) Phase-level checkpointing meant the user had to wait ~5 min for
the first phase to finish before crash testing was meaningful
(else recovery just restarted phase 1 from scratch and the demo
looked like nothing happened).
CHANGES
agent.py — subcall-level checkpoints
- The handler now persists {in_progress_phase, completed_subcalls,
current_text} on top of the prior {completed_phases, results}
state. After each LLM subcall returns we flush to ctx.metadata.
- On recovery (ctx.entry_mode == 'recovered'), if we crashed
mid-phase we resume that same phase at the next un-finished
subcall, re-using the text we had already produced.
- Worst-case work lost on crash drops from ONE FULL PHASE (~3 min
+ 3 wasted LLM subcalls) to ONE SUBCALL (~30-60s + 1 LLM
subcall). Crash testing is now meaningful at any point in the
run, not just after a phase boundary.
- Phase-complete checkpoint additionally clears the in-progress
fields so the next phase starts cleanly.
agent.py — cooldown events
- New _cooldown(ctx, duration, stage, phase, subcall=, of=) helper
that emits a 'cooldown' SSE event before the asyncio sleep:
{type:cooldown,duration_sec:30,stage:intra_phase,
phase:2,total:15,subcall:3,of:4, ...}
- Replaces the bare asyncio.wait_for in both the intra-phase
(between subcalls) and inter-phase (between phases) cooldowns.
- The wait stays cancel-aware (steering / operator cancel still
short-circuit the cooldown).
demo-client.sh — cooldown renderer
- Added a 'cooldown' case to the python renderer that prints a
single dim line, e.g.
[18:00:42Z] ...cooling down 30s (between subcalls) — next: subcall 3/4 in phase 2/15
- One line per cooldown, no spam.
README — updated the 'what the agent does' blurb to reflect:
- Checkpoints are now per-subcall (not per-phase).
- Cooldowns emit visible SSE events.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… heredoc)
Symptom (user-reported):
Traceback ... NameError: name 'duration_sec' is not defined
Root cause: my previous commit added the cooldown event renderer with
a Python string literal using single quotes:
evt.get('duration_sec', 0)
The single quotes prematurely terminated the surrounding bash heredoc
(_PY_RENDERER=apostrophe...apostrophe), so the runtime python source
was silently truncated. Bash quote concatenation made it look like a
NameError on duration_sec several lines later in the parsed script.
Fix
- Alias the dict key as a module-level constant _DSEC = 'duration_sec'
(with double quotes, safe). Use evt.get(_DSEC, ...) at the call site.
- Add a CRITICAL header comment explaining the gotcha so future edits
do not reintroduce apostrophes. The header itself is reworded to
avoid using the literal character.
- Reword the inline NOTE comment for the same reason.
Verified
- bash -n parses
- python ast.parse on the extracted heredoc parses
- Functional smoke: phase_end and cooldown events render correctly,
duration_sec extracts and formats as expected.
No server-side change.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ints + cooldown events) Captures the v31 deploy that ships the subcall-level checkpointing and cooldown-event emission from commit 2925f1d. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary — TEMPORARY / DO NOT MERGE
This is the
durable-agent-demosplit out of the original spec 016durability PR (#46997). It carries the azd-deployable hosted-agent
demo (34 files: bicep infra, .azure azd state,
src/durable-research-agentagent code, build/demo-client scripts).
Status
🚨 This PR is not intended for merge. The demo lives here purely so
it isn't lost from the working set; we use it temporarily as a
reference deployment while the durable-task primitive matures.
Scope
sdk/agentserver/azure-ai-agentserver-invocations/samples/durable-agent-demo/only (34 files). Plus whatever else came from the original split-point
branch — see the next section for cleanup needed.
What this branch needs before any potential reuse
durable-agent-demo/directory(everything else should be discarded by reverting to
origin/mainfor those paths, since the core+invocations work belongs in feat(agentserver): light up durable-task primitive (core 2.0.0b6 + invocations 1.0.0b5) #46997
and the responses work belongs in PR feat(agentserver): WIP - responses package durable orchestration (split out of #46997) #47275).
Pointers
samples/durable_research)derived from this demo is shipping in PR feat(agentserver): light up durable-task primitive (core 2.0.0b6 + invocations 1.0.0b5) #46997 — the demo here
remains as the fuller azd-deployable reference.