Skip to content

demo(agentserver): TEMPORARY - durable-agent-demo (split out of #46997, never-merged)#47276

Draft
RaviPidaparthi wants to merge 30 commits into
feature/agentserver-durable-tasksfrom
feature/agentserver-durable-agent-demo
Draft

demo(agentserver): TEMPORARY - durable-agent-demo (split out of #46997, never-merged)#47276
RaviPidaparthi wants to merge 30 commits into
feature/agentserver-durable-tasksfrom
feature/agentserver-durable-agent-demo

Conversation

@RaviPidaparthi
Copy link
Copy Markdown
Member

Summary — TEMPORARY / DO NOT MERGE

This is the durable-agent-demo split out of the original spec 016
durability PR (#46997). It carries the azd-deployable hosted-agent
demo (34 files: bicep infra, .azure azd state, src/durable-research-agent
agent code, build/demo-client scripts).

Status

🚨 This PR is not intended for merge. The demo lives here purely so
it isn't lost from the working set; we use it temporarily as a
reference deployment while the durable-task primitive matures.

Scope

sdk/agentserver/azure-ai-agentserver-invocations/samples/durable-agent-demo/
only (34 files). Plus whatever else came from the original split-point
branch — see the next section for cleanup needed.

What this branch needs before any potential reuse

Pointers

@github-actions github-actions Bot added the Hosted Agents sdk/agentserver/* label Jun 2, 2026
This commit restores the azd-deployable durable-agent-demo (34 files)
that was moved out of the core PR (#46997) to keep scope manageable.
Sits on top of the core PR branch so it only shows the demo delta.

🚨 TEMPORARY — this PR is NOT intended for merge. The demo lives
here purely so it isn't lost from the working set; we use it as a
reference deployment while the durable-task primitive matures. The
distilled invocations sample (samples/durable_research) derived from
this demo ships in PR #46997 instead.

Restored from safety-spec016-backup-2026-06-02 (SHA 3df9c5b).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@RaviPidaparthi RaviPidaparthi changed the base branch from main to feature/agentserver-durable-tasks June 2, 2026 03:33
RaviPidaparthi and others added 11 commits June 2, 2026 03:46
…16 core

Three call-site updates to align the demo with the spec 016 public surface:

1. Drop TaskTerminated from imports (it was removed from the public
   surface — TaskCancelled now covers cooperative-cancel paths).

2. Drop session_id= from deep_research.start() — session is platform-
   derived from FOUNDRY_AGENT_SESSION_ID, not a per-call argument.

3. Await deep_research.get_active_run(task_id) — it's now an async
   method (the framework needs to consult the task store, not just
   in-memory state) so the previous synchronous call returned a
   coroutine, not a TaskRun.

Also refreshes the bundled wheels (b4 -> core 2.0.0b6 + invocations
1.0.0b5) and the azd env state from a fresh 'azd up' deployment
against the e2e-tests-westus2 Foundry project.

Verified end-to-end against the deployed agent:
  ./demo-client.sh start "durable tasks demonstration"
  -> streams stages 1/12, 2/12, ... live via SSE
  ./demo-client.sh crash
  -> {"status":"crashing"}; supervisor restarts the container

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…story

The platform now provides two capabilities that obsolete our application-level
infrastructure:
  * nanny worker restarts the container within ~5-10 min of a crash
  * lease-renewal on @task internally pings /readiness so the sandbox stays
    alive as long as a durable task is executing (no need for client traffic)

This commit rewrites the durable-research-agent demo around those guarantees
and adds steering as a third headline capability.

Removed:
  * supervisor.py (170 lines) — the PID-1 reverse proxy + restart loop
  * entrypoint.sh — the auto-restart bash wrapper
  * aiohttp production dep (only supervisor used it)
  * Dockerfile FOUNDRY_TASK_API_ENABLED (auto-selected now per dev guide)
  * Multi-shape input parsing (per feedback: it's a demo; stick to one shape)

Container now runs 'python app.py' directly; CMD changed accordingly and
azure.yaml startupCommand updated to match.

agent.py rewrite (~221 lines):
  * 15 phases x 4 LLM sub-calls (research -> critique -> refine -> synthesize)
    targeting ~45 min total wall time (gpt-4.1-mini, 1500 output tokens/call).
    Env-overridable: NUM_PHASES, CALLS_PER_PHASE, TARGET_OUTPUT_TOKENS,
    INTRA_PHASE_COOLDOWN_SEC, INTER_PHASE_COOLDOWN_SEC.
  * Every phase emits phase_start + phase_end events with server_time_utc
    (UTC ISO8601 with ms) and server_uptime_sec. The uptime resets to ~0
    after the platform nanny restarts the container — so a viewer can
    SEE the crash recovery happen in the stream.
  * @task(steerable=True). On every checkpoint boundary the handler
    checks ctx.cancel.is_set() and (when pending_input_count > 0)
    emits a winding_down event with cause + returns ctx.suspend(). The
    framework drains the next steering input as a fresh turn.
  * Topic-change detection at handler entry resets checkpoint state when
    the steered topic differs from the previously stored one.

app.py rewrite:
  * task_id = session_id (was invocation_id) so steering routes correctly:
    second POST on the same session hits the active task and queues input.
  * POST /invocations with {"message": "crash"} (DEMO_MODE=1) exits the
    process so the platform nanny restarts the container. The platform
    only proxies /invocations* — we can't add custom routes.
  * GET /invocations/{id} falls back to file replay when no live run is
    present, so reconnecting after the task completes still shows the
    full transcript (regression fix per review feedback).
  * session_id read from app.config.session_id when not on request.state
    (GET state doesn't carry it).

demo-client.sh rewrite:
  * Pretty SSE renderer that recognises the new event types (run_start,
    recovered, phase_start/end, subcall_start/end, winding_down,
    run_complete) and box-prints the timestamps.
  * Commands: start, stream (reconnect), steer, crash, cancel, status,
    logs, reset. No more auto-reconnect spam — disconnects suggest manual
    reconnect, matching the long-run / no-ingress demo flow.
  * Three-terminal demo workflow documented in --help and README.

README rewrite:
  * Documents the three capabilities (long-running > 15 min, crash
    recovery via platform nanny ~5-10 min, steering).
  * New A/B/C demo walkthroughs (long-run no-ingress, crash-recovery,
    steering).
  * Architecture diagram drops supervisor; lists the platform-managed
    behaviors (nanny, lease renewal) explicitly.
  * Env var table reflects new tuning knobs.

Verified end-to-end against the deployed agent
(e2e-tests-westus2 / durable-research-agent):
  * Streaming with timestamps  -> all event types render correctly
  * Steering                   -> 'Steering drain: task drained next input'
                                  in server logs; new turn runs new topic
  * GET after completion       -> HTTP 200 + SSE file replay
  * Crash dispatch             -> POST returns 202; next POST gets 424
                                  (container down, awaiting nanny restart)

Refreshed wheels (core 2.0.0b6 + invocations 1.0.0b5) and azd env state.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… detection

Two changes after platform-behavior verification testing:

1. README / app.py / demo-client.sh: correct the platform-restart story.
   Earlier docs said 'platform nanny restarts the container within
   ~5-10 minutes' — autonomous. Empirical observation during testing
   showed the actual behavior is **ingress-triggered**:

     * Crashed containers stay down with NO ingress.
     * The next inbound request triggers the platform to bring the
       container back, which happens in ~10 seconds (much faster than
       the 5-10 min worst-case figure).
     * The durable task then auto-recovers from its last checkpoint.

   Verified by waiting 16 min after crash with zero ingress, then
   reconnecting and observing the container started 11 sec AFTER my
   reconnect GET (server logs: 'AgentServerHost started 21:03:18' for
   reconnect at 21:03:07). User-facing experience is unchanged: any
   reconnect attempt seamlessly restores the task.

   (The lease-renewal-keeps-sandbox-alive story is also verified —
   Test B showed phases progressing from uptime 47s -> 569s linearly
   with no resets during a 9.5-min no-ingress window. The framework's
   internal lease-renewal cycle ingresses /readiness internally, which
   keeps the sandbox alive while the @task is executing.)

2. agent.py _wind_down(): change cause detection to use exclusion.
   ctx.pending_input_count is often back to 0 by the time the wind-down
   triggers (the framework drained the steering input before we
   observed). Detect by elimination instead: if neither timeout nor
   operator_cancel, it must be steering. Removes the bogus 'unknown'
   cause we saw in steer-test output.

Verified end-to-end against the deployed agent:
  Test A (crash recovery, no-ingress):
    20:42:32 dispatched 'carbon capture technology'
    20:46:52 crash (after 4 phases done, uptime 205s)
    20:46:52 + 16 min: NO ingress
    21:03:07 reconnected with GET
    21:03:18 container started (uptime 1.3s), task recovered from
             phase 5 checkpoint, resumed at phase 6
  Test B (lease keeps sandbox alive):
    20:23:59 dispatched 'supply chain resilience'
    20:23:59 + 17 min: NO ingress
    During wait: phases 1-10 all completed; uptime grew 1.9s -> 569s
                 linearly (no restarts during the no-ingress window)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Removed the 'sandbox stays alive while @task executes' claim from the
README's headline capabilities — empirical testing showed this is NOT
happening on the current platform deployment.

What we verified:
  * Container is reclaimed at exactly the 15-min mark since the last
    user-facing ingress, regardless of whether a @task handler is
    actively running.
  * Framework's lease-renewal cycle goes to the task-store API
    (PATCH /api/projects/.../tasks/{id}), NOT to the agent container's
    /readiness endpoint. So lease renewal doesn't reset the platform's
    idle timer.
  * Crashed/reclaimed containers stay down with zero ingress.
  * The next ingress request brings the container back in ~10 sec.
  * Durable task auto-resumes from last checkpoint (entry_mode='recovered',
    correct completed_phases).

README now describes the demo as two capabilities:
  1. Crash + idle recovery — any reconnect after a crash or 15-min
     idle reclaim seamlessly resumes from the last checkpoint
  2. Steering — mid-run topic switch via cooperative wind-down

Long-running tasks DO complete (just by being reclaimed-and-recovered
repeatedly rather than running uninterrupted on a single container).
Section A reframed accordingly.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Three issues from review:

1. Session-id lifecycle was unclear. Added a 'Session-id lifecycle'
   subsection explaining that 'start' allocates a new UUID and writes
   it to .demo-session; stream/steer/crash/cancel/logs/status all reuse
   it; 'reset' clears it so the next 'start' allocates a fresh one.

2. Log inspection wasn't documented. Added an 'Inspecting container
   logs' subsection that points to './demo-client.sh logs' and 'azd ai
   agent monitor', and enumerates the most useful framework log lines
   (TaskManager starting, Reclaimed/Recovered task, /readiness probe,
   OpenAI HTTP requests, Steering drain).

3. Architecture diagram was stale. Removed 'POST /demo/crash' (no such
   route — platform only proxies /invocations*), removed the false
   'lease renewal pings /readiness' callout, and added a clearer
   diagram that shows the Foundry control plane separately and calls
   out the actual mechanisms (lease renewal goes to task-storage API,
   /readiness is hit only by platform startup probe, container revival
   is ingress-triggered).

Also added an upfront command-reference table covering every
demo-client.sh subcommand, and fixed the env-var doc to reflect that
overrides happen via Dockerfile/azure.yaml (the container runs the
shipped image, not a local python app.py).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Picks up the new TaskRun.__await__ method from the core branch
(merged in). With this, callers of get_active_run / start can await
the returned TaskRun directly to get the TaskResult, removing a
pyright squiggle on:

    run = await deep_research.get_active_run(task_id)

No changes to the demo's app.py or agent.py — they already use the
correct pattern. This is purely refreshing the bundled wheels so the
deployed agent picks up the new core build.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…r imports

The demo dir lives under the invocations package which has a
pyrightconfig.json that excludes samples/** but still applies its
rules to opened files. When the IDE opens app.py / agent.py, it
couldn't find the editable-installed agentserver packages without an
explicit venvPath / venv setting.

Adding a demo-local pyrightconfig.json that:
  * points venv at the repo's .venv (via the relative path)
  * suppresses reportMissingImports / reportAttributeAccessIssue
    (the in-tree editable install is enough; the imports work; we
    don't need warnings telling us otherwise on a demo)
  * keeps the meaningful checks (Optional access, argument type,
    general type issues, return type)

Verified: pyright runs clean from the demo dir with this config
(0 errors, 1 informational warning on .output Optional access).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
I added a demo-local pyrightconfig.json earlier in this session to
work around an IDE squiggle. Root cause was much simpler: the venv
just had an OLD wheel (2.0.0b4) cached from way back. Reinstalling
the new 2.0.0b6 wheel (which has TaskRun.__await__) in the venv
makes everything resolve correctly without any pyright config
changes — the IDE was working fine before; this restores that.

Reinstall command:
  pip uninstall -y azure-ai-agentserver-core azure-ai-agentserver-invocations
  pip install sdk/agentserver/azure-ai-agentserver-core \
              sdk/agentserver/azure-ai-agentserver-invocations

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@RaviPidaparthi RaviPidaparthi force-pushed the feature/agentserver-durable-agent-demo branch from 756e0fe to 2553746 Compare June 2, 2026 22:50
RaviPidaparthi and others added 15 commits June 2, 2026 22:52
The previous attempt to set FOUNDRY_TASK_API_ENABLED was rejected by
the hosting platform (FOUNDRY_*/AGENT_* are reserved namespaces). Core
has been updated to use AGENTSERVER_TASK_API_ENABLED instead — apply
that here and refresh the bundled wheels.

Effect: the demo container now uses HostedTaskProvider, so /tasks HTTP
calls (lease renewals, readiness pings, state PATCHes) flow through
the TaskApiLoggingPolicy and show up in 'demo-client.sh logs' as
'task-store request: ...' lines.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… validation

Captures the v25 deploy that exercised the lease-renewal + nanny-restore
validation:

  Test 1 — lease keeps sandbox alive >15 min without client ingress: PASS
    Same lease_instance_id for 46+ min, 12 phases completed, only platform
    /liveness probes and our framework's PATCH .../tasks/<id> lease
    renewals (every ~30s) kept the sandbox warm.

  Test 2 — nanny restores crashed sandbox within ~15 min, zero ingress: PASS
    Crashed at 00:02:04Z; new worker came up at 00:02:47Z (43s later);
    durable task auto-resumed with entry_mode='recovered' from the last
    checkpoint (completed_phases: 2); progressed through 4 more phases
    with no client ingress.

agent.yaml is back to default cooldowns (10/20s); only the
AGENTSERVER_TASK_API_ENABLED=1 opt-in is retained (committed earlier
in 1b1e334).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… validated behavior

Sets hosted-mode INTRA_PHASE_COOLDOWN_SEC=30 and INTER_PHASE_COOLDOWN_SEC=30
in agent.yaml so the deployed durable-research-agent runs for ~33 min
(15 phases × (~12s LLM + 3×30s intra + 30s inter)). The run intentionally
exceeds the platform's 15-min sandbox-eviction window so each demo run
exercises the framework's lease-renewal keep-alive path end-to-end —
which is the whole point of @task durability and what we just validated
empirically against e2e-tests-westus2.

agent.py defaults (10/20s = ~15 min) are kept for local/dev iteration
where the long wall-time isn't useful.

README updates reflect what we proved (rather than what we previously
assumed):
- Recovery section now leads with 'long-running tasks survive past 15 min
  via lease-renewal keep-alive' as a first-class platform capability,
  not buried in a doubt-laden footnote.
- Removed the 'Note on long-running tasks' disclaimer that claimed
  lease renewals do NOT extend the idle window — empirical evidence
  shows otherwise (Test 1: 46-min uptime, same instance throughout,
  zero client ingress after T=0).
- Workflow A retitled 'Long-running run with no client-side keepalive'
  and rewritten to reflect: reconnecting after 25 min finds the SAME
  instance, not a recovered fresh one.
- Workflow B (crash) reflects the nanny does the restore on its own
  within ~1 min — no client ingress required to bring the container
  back; the durable task auto-resumes inside the new process.
- Architecture diagram's 'Idle-reclaim timer' note now explains it is
  kept fresh by framework lease-renewal traffic.
- Env-var table now lists hosted vs agent.py defaults separately and
  includes AGENTSERVER_TASK_API_ENABLED with explanation.
- Fast-dev-loop block now points at agent.yaml (not the Dockerfile)
  since env vars live in agent.yaml now.

azd state synced to the v26 deploy that ships these settings.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…renderer + add client wall-clock

Core branch now auto-enables HostedTaskProvider in hosted environments,
so this demo no longer needs AGENTSERVER_TASK_API_ENABLED. Likewise,
wheels are now built centrally via sdk/agentserver/scripts/build-wheels.sh
and staged into the docker build context — no committed wheels.

CHANGES

  agent.yaml
    - Drop AGENTSERVER_TASK_API_ENABLED (auto-on in hosted).
    - Tighten the cooldown comment (no behavior change).

  build.sh
    - Delegate to the central sdk/agentserver/scripts/build-wheels.sh.
    - Stage wheels into src/durable-research-agent/wheels/ (gitignored
      docker-build dir), so the Dockerfile's COPY wheels/ ... still
      finds them at build time.
    - Per-sample build.sh is now a thin staging wrapper; no per-sample
      duplication of the build logic.

  src/durable-research-agent/wheels/*.whl  (deleted)
    - Wheels are no longer committed. They're regenerated on demand.

  app.py — fix file_replay SSE double-encoding
    - FileStreamHandler.put writes json.dumps(item)+'\n', where item
      is itself a JSON string from ctx.stream(json.dumps({...})). The
      live_stream path correctly reads from the in-memory queue (which
      holds the original string). The file_replay path read the disk
      line via json.loads, then RE-WRAPPED with json.dumps before
      embedding in 'data: ...\n\n' — producing
        data: "{\"type\": \"...\"}"
      which the client rendered as '[unknown event] "{\"...\"}"'.
    - Decode once, embed the raw JSON string directly. Also add an
      isinstance check before the __done__ key lookup (the decoded
      value is a string for normal events).
    - Update crash-handler 202 response message + docstring to reflect
      validated behavior (nanny restores ~1 min, no ingress needed).

  demo-client.sh
    - Add _now_utc() helper and prefix every block-style event with
      '[HH:MM:SSZ]' — the client's local UTC wall-clock at render
      time — so users can compare against server_time= (server-side
      UTC) and uptime= (server process seconds-since-boot) for a
      clear timeline of phases vs lease renewals vs recoveries.
    - Update header comment: drop the wrong '~5-10 min' nanny restore
      and the wrong 'lease renewal pings readiness' phrasing; reflect
      the validated 30s lease cadence and ~1 min nanny window.
    - Three-terminal usage example: ~33 min (not 45) wall-time per
      run; nanny restores ~1 min after crash (no need to send any
      ingress to trigger recovery).
    - Crash-command output text: nanny brings container back on its
      own, no client action required.

  README.md
    - Capability #1 reframed: lease keep-alive proven end-to-end
      (e2e-tests-westus2), 33-min runs with zero client ingress.
    - Capability #2 reframed: nanny restores within ~1 min (43s
      measured) without any client ingress; recover-on-reconnect was
      a misread of the old behavior.
    - Deploy section: build.sh now delegates to the central script;
      points at USING_PRE_RELEASE_WHEELS.md for the wheel workflow.
    - Crash command row in the command-reference table: clearer wording
      around nanny-driven recovery.
    - Env-var table: drop AGENTSERVER_TASK_API_ENABLED row (gone);
      add a paragraph clarifying that hosted/local provider selection
      is automatic.
    - File-structure section: build.sh and wheels/ entries reflect the
      new layout; add pointer to the wheel-distribution doc.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…plify build.sh to copy-only

Merges the core branch's three corrections:
  - Skill moved out of .github/skills/ into sdk/agentserver/docs/
    (standalone artifact, devs copy independently).
  - @task preview wheels checked into sdk/agentserver/wheels/.
  - USING_PRE_RELEASE_WHEELS.md framing fixed (packages ARE on PyPI;
    @task primitive is private preview).

Demo-specific changes that follow from the above:

  build.sh
    - No longer invokes sdk/agentserver/scripts/build-wheels.sh.
    - Just copies the checked-in central wheels into the per-sample
      gitignored docker-build staging dir. Faster, no compilation.

  README.md
    - Deploy section: 'stage the checked-in @task preview wheels' (not
      'build agentserver wheels'). Adds a note that @task is private
      preview and the wheels are how you get it.
    - File-structure blurb: matches the new copy-only build.sh.

  .gitignore
    - Merged the demo-local Docker-staging entry with the existing
      .azure / .demo-session entries from this branch.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…eel docs

Following the core-branch reorganization that moved
sdk/agentserver/docs/USING_PRE_RELEASE_WHEELS.md → sdk/agentserver/wheels/README.md,
update the demo's links and a build.sh comment to the new path.

No behavior change.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
PROBLEM
The demo's previous live_stream tracked event_id with a per-invocation
counter (event_id starts at 0 on each new GET). Combined with the
single-consumer queue contract, a client reconnect with
?last_event_id=N could not deterministically resume — the meaning of
event_id N depended on the queue's current state, not the actual
emission position.

Concretely observed: with last_event_id=8092 on a long-running task,
a reconnect landed at phase 8's mid-content (not the next event after
8092) because (a) prior consumers had dequeued items the new GET could
not see, and (b) the new live_stream counted from 1 again, advancing
through whatever was currently in the queue.

FIX (smallest possible)

1. FileStreamHandler now tracks a single _next_event_id counter
   incremented on every disk-line append — preload from disk on
   __init__, normal put, and the __done__ sentinel in close. Items go
   onto the queue as (event_id, item) tuples instead of bare items.
   event_id == disk row number == durable across restarts, recovery,
   and consumers.

2. app.py live_stream unpacks (event_id, chunk) tuples and uses the
   durable event_id directly when forming the SSE 'id: N' header.
   skip_count semantics are now correct: items with event_id <=
   skip_count are skipped; the rest are emitted with their durable id.

3. Defensive non-tuple unpack path keeps the GET handler safe if the
   FileStreamHandler is ever swapped for a stock QueueStreamHandler
   that emits bare items.

ACCEPTED LIMITATION
If a prior consumer has drained items the new GET expected to see,
those items are simply not emitted (queue is single-consumer per the
framework's StreamHandler contract — there's no way to backfill from
disk without a larger refactor). Per user direction: 'one or two delta
misses are acceptable; just be graceful.' We achieve that — the new
GET emits whatever is currently in the queue and resumes cleanly from
there.

SMOKE TEST RESULT (v32 deploy)
- Fresh GET: ids 1..1973 ✓
- Resume last_event_id=1973: starts at 1974, exact continuation ✓
- Resume last_event_id=10 after drain: starts at 2011 (gap skipped
  gracefully, no error, monotonic forward progress) ✓
- Drain to 2978 then resume from 1489: starts at 2979 (graceful gap
  skip, ids strictly monotonic) ✓

file_replay path already used disk-line counting — no change needed
there; live_stream and file_replay now agree on the event_id space.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ithin seconds, not minutes

PROBLEM
User reported: after issuing ./demo-client.sh crash, the SSE stream
on the original terminal kept showing events for *minutes* before the
disconnect surfaced. This was not a server, proxy, or TCP buffering
issue — it was the demo-client renderer itself building a backlog.

ROOT CAUSE
Each rendered event was spawning python3 subprocesses:

  * etype detection — 1 python3 per event (~30ms)
  * _now_utc()       — 1 'date' subprocess per event (~5ms)
  * Token content    — 1 python3 per token (~30ms)

For the token hot path that meant ~65ms per token. LLMs emit at
50-100 tok/s, so the renderer was running at ~10% of the server's
emit rate. The kernel TCP buffer + curl + bash pipe accumulated a
backlog that grew ~9 seconds per second of LLM streaming. When the
server crashed, that backlog still had to drain through the slow
renderer before the EOF on curl reached the bash 'while read' loop.

Measured before:
   100 token renders = 9.7s
  1000 token renders = 51s
  5000 token renders = timed out at 90s

FIX (minimal, no behavior change)
- etype detection: bash regex on the JSON instead of python3.
- _now_utc(): moved from top-of-render_event into only the cases
  that actually use it (token + subcall_end don't need wall-clock).
- Token content extraction: bash regex + parameter-expansion
  unescape for the four common JSON escapes (\\, \", \n, \t,
  \r). Token literal \uXXXX would print as the raw escape; that's
  acceptable for a demo.

Measured after:
  5000 token renders = 1.17s   (~0.23ms per token, ~220x faster)
  phase_start render = 253ms   (still uses _jq; happens 1/3min)

Effect: renderer is now ~50x faster than the LLM emit rate, so no
backlog builds. When the server crashes the client sees EOF within
its normal poll interval and surfaces the disconnect within seconds.

No server-side change.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…atchdog

PROBLEM
After the previous renderer-speedup, user reported 20-30s latency
between issuing a crash and seeing the stream disconnect, even early
in phase 1 — and that the latency seems to grow with longer streams.

INVESTIGATION
Built a localhost SSE server + bash client loop and measured. The
bash renderer is actually fast enough (3300 tok/s drain, 12ms
post-EOF detect on a clean close). So the residual latency is NOT in
the bash hot path. Two likely causes left:
  1. The platform edge proxy between the server container and the
     client buffers SSE responses and may hold the TCP connection
     open after the backend dies — there is no client-side way to
     speed up the EOF in this case.
  2. printf-per-token to a real interactive terminal (vs the
     /dev/null benchmark) has per-call overhead the renderer cannot
     amortize.

FIX
Replace the bash 'while read | render_event' loop with a single
long-lived python renderer. python is fundamentally better-suited
for line-rate streaming with batching:

  - In-memory token buffer flushed every ~50ms instead of a
    printf-per-token (~20x fewer terminal syscalls in steady state).
  - select() + idle-timer in one loop: tokens batch under load,
    block events render immediately, and an idle watchdog fires
    after STALL_SECS of no inbound data.
  - When the watchdog fires the renderer SIGTERMs curl (its PID is
    passed via env var) so the bash pipeline exits within a couple
    hundred ms of the warning, regardless of whether the platform
    proxy is still holding the socket open.

The renderer is embedded inline in demo-client.sh as a heredoc
(_PY_RENDERER); no separate file. ANSI color codes and event-type
formatting match the previous bash implementation exactly.

The bash render_event + _jq helpers are deleted (no longer used).
Most of stream_sse is gone too — replaced by a small wrapper that
launches curl in the background to capture its PID and feeds its
output to python via a FIFO.

KNOBS (env)
  STALL_SECS  default 10  — stream-idle threshold for the watchdog
  FLUSH_MS    default 50  — token-buffer flush cadence

VERIFIED LOCALLY (test harness against a python SSE server)
  Happy path: 50-token stream, clean close
    - Total wall: 1.04s (matches server emit time)
    - STREAM_RESULT=complete, LAST_EVENT_ID propagates correctly
  Stall path: 200 tokens, then server hangs (proxy-hang simulation)
    - Tokens render smoothly during emission
    - 5s after last token the watchdog warns and SIGTERMs curl
    - Bash pipeline exits in 9s total (was 24s before the kill-curl
      fix, would have been 25s+ in production until proxy timed out)
  All renderer output (run_start/phase_start/subcall_start/tokens/
  phase_end/run_complete/done) renders with proper formatting,
  timestamps, and colors.

No server-side change.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…md_steer)

The previous commit (python renderer) deleted render_event + _jq
together because both were used by the bash SSE consumer that python
replaced. But cmd_start and cmd_steer still call _jq to extract
invocation_id / session_id from the one-shot POST response — a small
helper, not part of the streaming hot path. Restored the helper with
an updated docstring that calls out its narrowed scope.

Symptom: 'demo-client.sh: line 367: _jq: command not found' on
./demo-client.sh start, followed by an empty INV_ID.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…lf window

FALSE-POSITIVE OBSERVED
User reported: ./demo-client.sh start emitted research subcall 1/4
then triggered '⚠ stream stalled (no events for 10s)' even though no
crash occurred. Root cause: the hosted agent.yaml sets
INTRA_PHASE_COOLDOWN_SEC=30 and INTER_PHASE_COOLDOWN_SEC=30, so there
are legitimately ~30s silent periods between subcalls and between
phases (asyncio.sleep with no events emitted). A 10s watchdog
therefore mis-fires during normal operation.

FIX
1. Default STALL_SECS bumped 10 -> 60, comfortably above the longest
   planned silence (30s). Crash detection latency goes from 10s to
   ~60s in exchange for zero false positives during normal runs.
   Still better than the 20-30s baseline behavior the user saw before
   any watchdog at all.

2. Added a low-key hint when idle crosses HALF the stall window.
   Prints '...quiet for Ns (stall threshold 60s)' once every 10s,
   so the user sees the renderer is alive but quiet during cooldowns
   instead of wondering if it hung.

3. Hint counter resets every time data arrives, so back-to-back
   short cooldowns do not pile up hints.

VERIFIED locally
  Server: emit run_start, then 40s silence, then run_complete + close
  Client: STALL_SECS=60
    [00:00] run_start banner
    [00:30] '...quiet for 30s (stall threshold 60s)'
    [00:40] run_complete renders, STREAM_RESULT=complete

Both knobs remain env-overridable (STALL_SECS, FLUSH_MS).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…t SoT

User feedback: 'Why is the watchdog using a time-based idleness as
crash? Shouldnt we use the connection closure itself as the SOT?'

They are right. EOF on the curl pipe is the authoritative
crash/disconnect signal — TCP close happens when the server (or its
upstream proxy) terminates the SSE response. A time-based watchdog
duplicates that signal, mis-fires during legitimate quiet periods
(this demo has 30s cooldowns between subcalls and phases — see
INTRA_PHASE_COOLDOWN_SEC / INTER_PHASE_COOLDOWN_SEC in agent.yaml),
and forces every operator to tune cooldown-vs-detection-threshold.

REMOVED
- STALL_SECS env var and all its logic
- The 'half-window quiet hint' (only made sense alongside the watchdog)
- last_data_at and last_idle_hint state
- CURL_PID plumbing (no need to SIGTERM curl when there is no
  watchdog to force-close it)
- mkfifo / background-curl dance in stream_sse — now a plain pipe

KEPT
- FLUSH_MS token-buffer flush cadence (50ms) — still real and useful,
  it batches terminal writes so the renderer keeps pace with LLM emit
  rate.
- All ANSI formatting, event-type rendering, event_id passthrough.

EOF flow (the only disconnect path now)
  curl sees TCP close -> closes its stdout -> python's select() returns
  ready -> os.read returns b'' -> renderer flush_tokens + break out of
  while loop -> finally writes STATE_FILE -> bash sources state ->
  STREAM_RESULT=disconnected (or 'complete' if we saw run_complete /
  done first) -> _report_stream_result prints the right banner.

VERIFIED locally
  Happy path (clean close + run_complete):
    wall=1.05s, STREAM_RESULT=complete ✓
  Abrupt close (server emits 50 tokens then closes socket without
  emitting done):
    wall=1.04s (matches server timing exactly), STREAM_RESULT=disconnected,
    no false 'stalled' warning ✓

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
RaviPidaparthi and others added 3 commits June 3, 2026 18:23
Two user-reported issues, both addressed at the agent layer (no
framework changes):

1) The 30s cooldowns between subcalls / phases made the terminal go
   silent — felt like nothing was happening.
2) Phase-level checkpointing meant the user had to wait ~5 min for
   the first phase to finish before crash testing was meaningful
   (else recovery just restarted phase 1 from scratch and the demo
   looked like nothing happened).

CHANGES

agent.py — subcall-level checkpoints
  - The handler now persists {in_progress_phase, completed_subcalls,
    current_text} on top of the prior {completed_phases, results}
    state. After each LLM subcall returns we flush to ctx.metadata.
  - On recovery (ctx.entry_mode == 'recovered'), if we crashed
    mid-phase we resume that same phase at the next un-finished
    subcall, re-using the text we had already produced.
  - Worst-case work lost on crash drops from ONE FULL PHASE (~3 min
    + 3 wasted LLM subcalls) to ONE SUBCALL (~30-60s + 1 LLM
    subcall). Crash testing is now meaningful at any point in the
    run, not just after a phase boundary.
  - Phase-complete checkpoint additionally clears the in-progress
    fields so the next phase starts cleanly.

agent.py — cooldown events
  - New _cooldown(ctx, duration, stage, phase, subcall=, of=) helper
    that emits a 'cooldown' SSE event before the asyncio sleep:
        {type:cooldown,duration_sec:30,stage:intra_phase,
         phase:2,total:15,subcall:3,of:4, ...}
  - Replaces the bare asyncio.wait_for in both the intra-phase
    (between subcalls) and inter-phase (between phases) cooldowns.
  - The wait stays cancel-aware (steering / operator cancel still
    short-circuit the cooldown).

demo-client.sh — cooldown renderer
  - Added a 'cooldown' case to the python renderer that prints a
    single dim line, e.g.
       [18:00:42Z]   ...cooling down 30s (between subcalls) — next: subcall 3/4 in phase 2/15
  - One line per cooldown, no spam.

README — updated the 'what the agent does' blurb to reflect:
  - Checkpoints are now per-subcall (not per-phase).
  - Cooldowns emit visible SSE events.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… heredoc)

Symptom (user-reported):
  Traceback ... NameError: name 'duration_sec' is not defined

Root cause: my previous commit added the cooldown event renderer with
a Python string literal using single quotes:
    evt.get('duration_sec', 0)
The single quotes prematurely terminated the surrounding bash heredoc
(_PY_RENDERER=apostrophe...apostrophe), so the runtime python source
was silently truncated. Bash quote concatenation made it look like a
NameError on duration_sec several lines later in the parsed script.

Fix
- Alias the dict key as a module-level constant _DSEC = 'duration_sec'
  (with double quotes, safe). Use evt.get(_DSEC, ...) at the call site.
- Add a CRITICAL header comment explaining the gotcha so future edits
  do not reintroduce apostrophes. The header itself is reworded to
  avoid using the literal character.
- Reword the inline NOTE comment for the same reason.

Verified
- bash -n parses
- python ast.parse on the extracted heredoc parses
- Functional smoke: phase_end and cooldown events render correctly,
  duration_sec extracts and formats as expected.

No server-side change.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ints + cooldown events)

Captures the v31 deploy that ships the subcall-level checkpointing
and cooldown-event emission from commit 2925f1d.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Hosted Agents sdk/agentserver/*

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant