Skip to content

fix(ecs): write task payload to S3 instead of 8 KB containerOverrides (#502)#503

Draft
isadeks wants to merge 1 commit into
mainfrom
fix/502-ecs-payload-s3-pointer
Draft

fix(ecs): write task payload to S3 instead of 8 KB containerOverrides (#502)#503
isadeks wants to merge 1 commit into
mainfrom
fix/502-ecs-payload-s3-pointer

Conversation

@isadeks

@isadeks isadeks commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

Summary

Fixes #502 — the ECS/Fargate compute strategy rejected every real task with:

InvalidParameterException: Container Overrides length must be at most 8192

Root cause: ecs-strategy.ts inlined the entire orchestrator payload (incl. the large hydrated_context) into the AGENT_PAYLOAD container-override env var. ECS RunTask caps the total containerOverrides blob at 8192 bytes, so the call failed before the container started. AgentCore is unaffected — it passes the payload in the InvokeAgentRuntime request body, which has no comparable limit. The earlier ECS smoke test (#494, a small Rust cargo check) had a payload that fit under 8 KB, so it didn't surface this.

Fix — pass a pointer, not the payload

Piece Change
EcsPayloadBucket (new) Dedicated S3 bucket, mirrors TraceArtifactsBucket: BLOCK_ALL + enforceSSL + S3_MANAGED, 1-day lifecycle TTL. Dedicated (not co-tenant with attachments/traces) so the task role's read is scoped to payloads only.
ecs-strategy PutObject payload → <task_id>/payload.json; pass AGENT_PAYLOAD_S3_URI in the override; boot command fetches via boto3. Inline AGENT_PAYLOAD kept as fallback (small payloads / no bucket) — no regression. deleteEcsPayload helper.
orchestrate-task finalize Best-effort deleteEcsPayload for ECS tasks on terminal; the 1-day TTL is the crash backstop.
EcsAgentCluster Accept payloadBucket, inject ECS_PAYLOAD_BUCKET, grant task role read-only (untrusted repo code must not write/delete; the trusted orchestrator owns write+delete). Session-role-aware.
task-orchestrator ecsPayloadBucket prop → grantPut + grantDelete; @aws-sdk/client-s3 added to bundling externals.
agent.ts Updated the commented uncomment-to-enable ECS scaffolding to wire the payload bucket.

Security stance

  • ECS task role: read-only on the payload bucket (the container runs untrusted repo code).
  • Orchestrator: write + delete (trusted).
  • Bucket is private (BLOCK_ALL), TLS-enforced, encrypted; payloads expire in 1 day.

Testing

  • New ecs-payload-bucket.test.ts: TTL=1d, SSL-only, block-public, autoDelete.
  • ecs-strategy.test.ts: S3 write + AGENT_PAYLOAD_S3_URI pointer (no inline blob); inline fallback when no bucket; boot command fetch-from-S3-with-fallback; deleteEcsPayload delete + best-effort swallow + no-op without bucket.
  • ecs-agent-cluster.test.ts: ECS_PAYLOAD_BUCKET env, task role read-only (asserts no s3:Put*/s3:Delete*), omitted when no bucket.
  • Full mise run build green (cdk tests + agent + cli + docs + synth + lint).

Live verification (dev, ECS wired)

Deployed to a dev stack with --context compute_type=ecs and fired a real fork task:

  • compute_type=ecs, session_id = a real ECS task ARN → RunTask succeeded (the prior tasks died here with InvalidParameterException).
  • Payload object written to S3: payload.json = 8455 bytesabove the 8192-byte containerOverrides cap, i.e. exactly the payload that would have failed inline.
  • Container log: Using hydrated context from orchestrator, then cloned/branched/ran the build — the agent received and parsed the full payload via the S3 pointer.
  • Payload bucket live with the 1-day TTL; orchestrator has s3:PutObject/DeleteObject; ECS task role read-only.

Notes / scope

Closes #502

🤖 Generated with Claude Code

The ECS compute strategy inlined the full orchestrator payload (incl. the
large hydrated_context) into the AGENT_PAYLOAD container-override env var.
ECS RunTask caps the TOTAL containerOverrides blob at 8192 bytes, so any real
task was rejected before the container started:

  InvalidParameterException: Container Overrides length must be at most 8192

AgentCore is unaffected — it passes the payload in the InvokeAgentRuntime
request body, which has no comparable limit. The bug only surfaces with a
realistic hydrated payload, which is why the prior ECS smoke test (a small
Rust cargo-check, #494) didn't catch it.

Fix — stash the payload out-of-band and pass only a pointer:
- New EcsPayloadBucket construct (mirrors TraceArtifactsBucket): BLOCK_ALL,
  enforceSSL, S3_MANAGED encryption, 1-day lifecycle TTL (payloads are
  ephemeral — read once at boot). Dedicated bucket so the ECS task role's S3
  read is scoped to payloads only and can't touch attachments/traces.
- ecs-strategy: when ECS_PAYLOAD_BUCKET is set, PutObject the payload to
  <task_id>/payload.json and pass AGENT_PAYLOAD_S3_URI in the override; the
  boot command fetches+parses it via boto3. Inline AGENT_PAYLOAD remains as a
  fallback (small payloads / no bucket), so nothing regresses. deleteEcsPayload
  helper removes the object.
- orchestrate-task finalize: best-effort deleteEcsPayload for ECS tasks once
  terminal (the container has long since read it); lifecycle rule is the
  crash backstop.
- EcsAgentCluster: accept payloadBucket, inject ECS_PAYLOAD_BUCKET env, grant
  the task role READ ONLY (untrusted repo code must not write/delete payloads;
  the trusted orchestrator owns write+delete). Session-role-aware.
- task-orchestrator: ecsPayloadBucket prop → grantPut + grantDelete to the
  orchestrator; @aws-sdk/client-s3 added to bundling externals.
- agent.ts: updated the commented uncomment-to-enable ECS scaffolding to wire
  the payload bucket.

Tests: new bucket construct (TTL/SSL/block-public/autoDelete); strategy
S3-write + URI-pointer + inline fallback + deleteEcsPayload (incl. best-effort
swallow + no-op without bucket); cluster read-grant + env var + read-only
(no put/delete). Full build green.

Closes #502
@isadeks isadeks requested review from a team as code owners June 30, 2026 00:10
@isadeks isadeks marked this pull request as draft June 30, 2026 00:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ECS compute strategy: RunTask rejected — payload inlined into 8 KB containerOverrides limit

1 participant