fix(ecs): write task payload to S3 instead of 8 KB containerOverrides (#502)#503
Draft
isadeks wants to merge 1 commit into
Draft
fix(ecs): write task payload to S3 instead of 8 KB containerOverrides (#502)#503isadeks wants to merge 1 commit into
isadeks wants to merge 1 commit into
Conversation
The ECS compute strategy inlined the full orchestrator payload (incl. the large hydrated_context) into the AGENT_PAYLOAD container-override env var. ECS RunTask caps the TOTAL containerOverrides blob at 8192 bytes, so any real task was rejected before the container started: InvalidParameterException: Container Overrides length must be at most 8192 AgentCore is unaffected — it passes the payload in the InvokeAgentRuntime request body, which has no comparable limit. The bug only surfaces with a realistic hydrated payload, which is why the prior ECS smoke test (a small Rust cargo-check, #494) didn't catch it. Fix — stash the payload out-of-band and pass only a pointer: - New EcsPayloadBucket construct (mirrors TraceArtifactsBucket): BLOCK_ALL, enforceSSL, S3_MANAGED encryption, 1-day lifecycle TTL (payloads are ephemeral — read once at boot). Dedicated bucket so the ECS task role's S3 read is scoped to payloads only and can't touch attachments/traces. - ecs-strategy: when ECS_PAYLOAD_BUCKET is set, PutObject the payload to <task_id>/payload.json and pass AGENT_PAYLOAD_S3_URI in the override; the boot command fetches+parses it via boto3. Inline AGENT_PAYLOAD remains as a fallback (small payloads / no bucket), so nothing regresses. deleteEcsPayload helper removes the object. - orchestrate-task finalize: best-effort deleteEcsPayload for ECS tasks once terminal (the container has long since read it); lifecycle rule is the crash backstop. - EcsAgentCluster: accept payloadBucket, inject ECS_PAYLOAD_BUCKET env, grant the task role READ ONLY (untrusted repo code must not write/delete payloads; the trusted orchestrator owns write+delete). Session-role-aware. - task-orchestrator: ecsPayloadBucket prop → grantPut + grantDelete to the orchestrator; @aws-sdk/client-s3 added to bundling externals. - agent.ts: updated the commented uncomment-to-enable ECS scaffolding to wire the payload bucket. Tests: new bucket construct (TTL/SSL/block-public/autoDelete); strategy S3-write + URI-pointer + inline fallback + deleteEcsPayload (incl. best-effort swallow + no-op without bucket); cluster read-grant + env var + read-only (no put/delete). Full build green. Closes #502
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #502 — the ECS/Fargate compute strategy rejected every real task with:
Root cause:
ecs-strategy.tsinlined the entire orchestrator payload (incl. the largehydrated_context) into theAGENT_PAYLOADcontainer-override env var. ECSRunTaskcaps the totalcontainerOverridesblob at 8192 bytes, so the call failed before the container started. AgentCore is unaffected — it passes the payload in theInvokeAgentRuntimerequest body, which has no comparable limit. The earlier ECS smoke test (#494, a small Rustcargo check) had a payload that fit under 8 KB, so it didn't surface this.Fix — pass a pointer, not the payload
EcsPayloadBucket(new)TraceArtifactsBucket:BLOCK_ALL+enforceSSL+S3_MANAGED, 1-day lifecycle TTL. Dedicated (not co-tenant with attachments/traces) so the task role's read is scoped to payloads only.ecs-strategyPutObjectpayload →<task_id>/payload.json; passAGENT_PAYLOAD_S3_URIin the override; boot command fetches via boto3. InlineAGENT_PAYLOADkept as fallback (small payloads / no bucket) — no regression.deleteEcsPayloadhelper.orchestrate-taskfinalizedeleteEcsPayloadfor ECS tasks on terminal; the 1-day TTL is the crash backstop.EcsAgentClusterpayloadBucket, injectECS_PAYLOAD_BUCKET, grant task role read-only (untrusted repo code must not write/delete; the trusted orchestrator owns write+delete). Session-role-aware.task-orchestratorecsPayloadBucketprop →grantPut+grantDelete;@aws-sdk/client-s3added to bundling externals.agent.tsSecurity stance
Testing
ecs-payload-bucket.test.ts: TTL=1d, SSL-only, block-public, autoDelete.ecs-strategy.test.ts: S3 write +AGENT_PAYLOAD_S3_URIpointer (no inline blob); inline fallback when no bucket; boot command fetch-from-S3-with-fallback;deleteEcsPayloaddelete + best-effort swallow + no-op without bucket.ecs-agent-cluster.test.ts:ECS_PAYLOAD_BUCKETenv, task role read-only (asserts nos3:Put*/s3:Delete*), omitted when no bucket.mise run buildgreen (cdk tests + agent + cli + docs + synth + lint).Live verification (dev, ECS wired)
Deployed to a dev stack with
--context compute_type=ecsand fired a real fork task:compute_type=ecs,session_id= a real ECS task ARN →RunTasksucceeded (the prior tasks died here withInvalidParameterException).payload.json= 8455 bytes — above the 8192-bytecontainerOverridescap, i.e. exactly the payload that would have failed inline.Using hydrated context from orchestrator, then cloned/branched/ran the build — the agent received and parsed the full payload via the S3 pointer.s3:PutObject/DeleteObject; ECS task role read-only.Notes / scope
main(theagent.tswiring is the existing commented uncomment-to-enable scaffolding). This PR fixes the latent strategy bug + plumbing; it does not flip ECS on. Complementary to fix: two bugs that prevent ECS Fargate from working #494.Closes #502
🤖 Generated with Claude Code