feat: Bedrock cost attribution — session tags, request metadata, and operator FinOps guidance (#215) by krokoko · Pull Request #521 · aws-samples/sample-autonomous-cloud-coding-agents

krokoko · 2026-06-30T22:01:06Z

Summary

Implements Bedrock cost attribution (#215): attribute Bedrock model-inference spend per user and repo, complementing the in-app cost_usd meter and #211 per-session tenant isolation.

The key architectural fact driving the design: Bedrock is invoked by the Claude Code CLI subprocess (CLAUDE_CODE_USE_BEDROCK=1), not by the agent's boto3. So attribution cannot be wired through aws_session.py (which scopes DynamoDB/S3 tenant data) — both levers live in Claude Code's own configuration, set by the agent before it spawns the subprocess.

Track	Mechanism	Surfaces in
1 — IAM session-tag chargeback	`bedrock:InvokeModel*` granted to the existing `AgentSessionRole`; Claude Code's `awsCredentialExport` runs `bedrock_creds_helper.py`, which assumes that role with `{user_id, repo, task_id}` STS session tags	AWS Cost Explorer / CUR 2.0 (`iamPrincipal/` prefix), aggregated
2 — per-call forensics	`X-Amzn-Bedrock-Request-Metadata` set via `ANTHROPIC_CUSTOM_HEADERS` on the subprocess env	Bedrock model-invocation logs (`requestMetadata` field), per call
3 — operator guide	`docs/guides/COST_ATTRIBUTION.md` + cross-links	—

Tracks 1 and 2 are complementary (per AWS docs): session tags give aggregated billing chargeback (they are not written to invocation logs); request metadata gives per-call detail in logs (it is not a cost-allocation tag). You need both.

What was implemented

Track 1 — session-tagged credentials

cdk/src/constructs/agent-session-role.ts — new optional invokableModels prop. Each invokable is grantInvoke-ed to the SessionRole — the same grant the compute role receives, reused (not hand-rolled ARNs) so cross-region inference profiles fan out to every routed region and can't AccessDenied. No aws:PrincipalTag condition (the tags are for billing, not access scoping). Scoped to explicit model/profile ARNs — never Resource:'*'.
cdk/src/stacks/agent.ts — builds the invokable set in one loop (rebased onto feat(cdk): single source of truth for invocable Bedrock models, context-overridable (#433) #434's resolveBedrockModelIds), grants both the runtime and the SessionRole from the same list so the two grants can't drift.
agent/src/bedrock_creds_helper.py (new) — invoked by awsCredentialExport. Reads a 0600 file (SessionRole ARN + tags), assumes the role, emits {"Credentials":{…,"Expiration":…}}. The real Expiration drives Claude Code's pre-expiry refresh, beating the 1 h role-chaining cap on long tasks. Fails open to ambient compute-role credentials (this is a billing/observability control, not tenant isolation — contrast aws_session.py's fail-closed path).
agent/managed-settings.json + Dockerfile — awsCredentialExport lives in root-owned /etc/claude-code/managed-settings.json (copied before USER agent). Highest-precedence settings tier, loaded regardless of setting_sources=["project"], so the untrusted cloned repo cannot override it (it runs an arbitrary command — putting it anywhere the repo can influence would be RCE with the compute role).
agent/src/aws_session.py — extracted build_session_tags() so the in-process tenant path and the out-of-process Bedrock helper mint identical tags from one definition.

Track 2 — per-request metadata

agent/src/runner.py — _setup_bedrock_cost_attribution() writes the attribution file and sets ANTHROPIC_CUSTOM_HEADERS on the process env (deliberate, documented exception to the "tenant ids out of os.environ" rule — the values are self-referential and non-secret; json.dumps escaping blocks header injection).

Track 2 prerequisite fix — model-invocation logging now actually enables on deploy

Found during live verification: the ModelInvocationLogging custom resource sent largeDataDeliveryS3Config with an empty bucketName, which Bedrock rejects client-side (ValidationException, min length: 3). With ignoreErrorCodesMatching: '.*' swallowing it and onUpdate never re-firing (static props), a fresh deploy silently left model-invocation logging disabled — so Bedrock recorded no requestMetadata and Track 2 produced nothing to query.

Omit largeDataDeliveryS3Config entirely (optional; only for S3 large-data delivery, unused here).
Narrow ignoreErrorCodesMatching from .* to transient service errors (Throttling/ServiceUnavailable/InternalServer) so a client-side misconfiguration fails the deploy loudly instead of disabling logging silently.
Grant iam:PassRole on BedrockLoggingRole to the custom resource's role. PutModelInvocationLoggingConfiguration hands that role to the Bedrock service (to write the log group), so the caller needs PassRole on it. This was a second latent bug also masked by the .* ignore — narrowing the ignore made it surface and fail the deploy (as intended); fixed here, scoped to the one role ARN (not a wildcard).

Observability hardening (from review)

agent/src/bedrock_creds_helper.py — every fail-open path now logs to stderr (stdout is the credential channel Claude Code parses). Distinguishes severities: absent file (benign) vs present-but-unreadable (a write bug); expected ClientError/BotoCoreError assume failure vs UNEXPECTED errors; ImportError on boto3 (packaging defect). All still fail open — but a persistent degradation is now visible, not invisible.

Version alignment

claude-agent-sdk==0.2.110 (pyproject) ↔ npm @anthropic-ai/claude-code@2.1.191 (Dockerfile) pinned in lockstep — the SDK bundles a CLI and both must agree on the control protocol; 2.1.191 also has the awsCredentialExport-with-Expiration refresh behavior the design relies on.

Docs

New docs/design/BEDROCK_COST_ATTRIBUTION.md and docs/guides/COST_ATTRIBUTION.md (operator FinOps guide), cross-linked from COST_MODEL.md and DEPLOYMENT_GUIDE.md. Includes a prominent warning that in-app cost_usd is a client-side SDK estimate, not authoritative billing (mirroring the Claude Agent SDK cost-tracking caveat, adapted for Bedrock → authoritative source is AWS Cost Explorer/CUR), the correct (post-deploy, non-pre-activatable) cost-allocation-tag ordering, and how to verify/re-enable model-invocation logging. Starlight mirrors synced.

What was tested

Automated — full suites green:

CDK: 122 suites / 2211 tests pass. New: agent-session-role.test.ts asserts the Bedrock grant is present (scoped, no Resource:'*') when invokableModels is set and absent when omitted; agent.test.ts regression guards that the logging custom resource never sends largeDataDeliveryS3Config, never uses a catch-all error ignore, and grants iam:PassRole on the logging role.
Agent: 1100 tests pass, 79.7% coverage (gate 72%). test_bedrock_creds_helper.py covers the tagged assume + session name, all fail-open paths (absent/corrupt config, ClientError, unexpected error, no-creds), 0600 file mode, and the stderr diagnostics; test_runner.py covers attribution-file write + header assembly.
Lint (ruff, eslint) clean; cdk synth clean (cdk-nag passes).

Manual review — ran the PR-review toolkit (code-reviewer, silent-failure-hunter) and a security review. No CRITICAL/HIGH findings. The silent-failure review drove the observability hardening above. Security review verified the RCE boundary (root-owned managed-settings, repo can't override), IAM least-privilege (scoped grant, no wildcard), 0600 atomic write, no secret logging, and that json.dumps defeats header injection.

Live verification (deployed dev stack, us-east-1): a real agent task's Bedrock calls show all three metadata fields in the invocation logs, signed by the session-tagged role — proving both tracks end-to-end (Track 2 via requestMetadata, Track 1 via the abca-bedrock-<task_id> session ARN). This also resolves the one risk flagged in the design as unverified: Claude Code does sign the X-Amzn-Bedrock-Request-Metadata header. Redacted sample log record:

{
  "requestMetadata": {
    "user_id": "<redacted-cognito-sub>",
    "repo": "<owner>/<repo>",
    "task_id": "<task-ulid>"
  },
  "modelId": "arn:aws:bedrock:us-east-1:<account>:inference-profile/us.anthropic.claude-sonnet-4-6",
  "identity": {
    "arn": "arn:aws:sts::<account>:assumed-role/<stack>-AgentSessionRole<id>/abca-bedrock-<task-ulid>"
  }
}

Clean-deploy verification of the logging fixes: I reset the account's model-invocation logging config to empty, then ran cdk deploy of this branch. The ModelInvocationLogging custom resource went UPDATE_COMPLETE (previously UPDATE_FAILED), and the live config came back enabled by the deploy itself (pointing at the stack's own log group + BedrockLoggingRole) — confirming the full chain end-to-end: empty-bucket error removed → masking narrowed → iam:PassRole granted. Stack UPDATE_COMPLETE.

Note: the cost-allocation-tag activation (Cost Explorer / CUR side) remains an operator step documented in the guide and cannot be pre-activated — the tag keys only appear after the first tagged call.

Notes for reviewers

Rebased onto current main after feat(cdk): single source of truth for invocable Bedrock models, context-overridable (#433) #434 (configurable Bedrock models) merged — the invokable-model loop derives from resolveBedrockModelIds, so the two grant sites stay in lockstep with feat(cdk): single source of truth for invocable Bedrock models, context-overridable (#433) #434's single source of truth.
cdk/cdk.json is intentionally not included (local-testing context only).

Closes #215

Design for per-user/per-repo Bedrock spend attribution. Key finding: Bedrock is invoked by the Claude Code CLI subprocess, not the agent's boto3, so both tracks (IAM session tags + request metadata) are wired via Claude Code config (awsCredentialExport, ANTHROPIC_CUSTOM_HEADERS) and a new BedrockInvokeRole — not by extending aws_session.py. Refs #215

…ata (#215) Attribute Bedrock model-inference spend per user/repo. Bedrock is invoked by the Claude Code subprocess (CLAUDE_CODE_USE_BEDROCK=1), so attribution is wired through Claude Code's config, not the agent's boto3. Track 1 — IAM session-tag chargeback (CUR 2.0 / Cost Explorer): - Grant bedrock:InvokeModel* on the existing AgentSessionRole (reuse, not a new role) via grantInvoke, mirroring the compute-role grant exactly so cross-region profiles never AccessDenied. Compute role keeps its grant. - bedrock_creds_helper.py assumes the SessionRole with {user_id,repo,task_id} STS tags and emits creds JSON for Claude Code's awsCredentialExport, which refreshes before the 1h role-chaining cap. Fails OPEN to ambient creds (billing control, not isolation). awsCredentialExport lives in root-owned /etc/claude-code/managed-settings.json so the untrusted repo can't override it (RCE boundary). Track 2 — per-call forensics (model-invocation logs): - Set X-Amzn-Bedrock-Request-Metadata via ANTHROPIC_CUSTOM_HEADERS on the subprocess env (one container = one task, so static-per-process is per-task; process-env so the repo can't alter it). SigV4 signed-headers behavior to be validated live (AC#3 documented-blocker path). Track 3 — operator guide COST_ATTRIBUTION.md + cross-links, plus a prominent warning that in-app cost_usd is a client-side SDK estimate (authoritative source is AWS Cost Explorer / CUR 2.0), mirroring the Claude Agent SDK cost-tracking caveat. Align claude-agent-sdk 0.2.110 (bundles CLI 2.1.191) with the npm CLI pin. Tests: CDK Bedrock grant present/absent; helper assume + fail-open paths; runner file+header wiring. #211 tenant-isolation path untouched. Refs #215

) PR #434 replaces the six named model/profile bindings in agent.ts with a loop over a single source-of-truth id list. Our #215 SessionRole grant referenced those bindings by name, so the merge would break compilation. Adopt #434's loop+collection shape now: build each foundation model + its cross-region profile in a loop, grant the runtime, and collect into one list passed to AgentSessionRole.invokableModels. Behavior is byte-for-byte identical in synth; the eventual #434 merge becomes a one-line swap of the local id array for resolveBedrockModelIds(this.node). Refs #215, #434

…review) Silent-failure review flagged that bedrock_creds_helper.py degraded silently: a persistent assume-role denial would drop chargeback for weeks with no signal pointing back to this code — the 'invisible degradation' AI004 forbids even when the fallback itself is intended. - Add _warn() (stderr only — stdout is the credential channel Claude Code parses, so shell.log/fd1 is unusable here). - Log every fail-open path; distinguish severities: absent file (benign) vs present-but-unreadable (write bug), and expected ClientError/BotoCoreError assume failure vs UNEXPECTED errors. - Narrow the assume catch to (ClientError, BotoCoreError); catch ImportError on boto3 separately (packaging defect, not AccessDenied). All still fail open. Behavior unchanged (still fail-open to ambient creds); degradations are now visible and correlatable. Tests cover each distinguished path + its diagnostic. Refs #215

Security review (LOW/accepted): unlike tenant-data tags, the request-metadata header lives on os.environ because Claude Code reads it from there. Document why that's safe (self-referential non-secret values; json.dumps escaping blocks header injection) in both the code and the design doc, so it reads as intent rather than an oversight against the 'tenant ids out of os.environ' discipline. Refs #215

The IAM-principal tag keys can't be pre-activated — they only appear in the Billing console after the platform makes tagged Bedrock calls. Fix the ordering (deploy → run task → wait ≤24h → activate), point to Billing → Cost allocation tags (not Tag Editor / Resource Groups, which lists resource types), and note the capability may not be enabled in every account/region yet. Refs #215

The ModelInvocationLogging custom resource sent largeDataDeliveryS3Config with an empty bucketName. Bedrock rejects that client-side (ValidationException, 'min length: 3'), and ignoreErrorCodesMatching: '.*' swallowed it while onUpdate never re-fired (static props) — so a fresh deploy silently left model-invocation logging DISABLED, and Bedrock recorded no requestMetadata (#215 Track 2 produced nothing to query). Found during live verification of task 01KWD7S.... - Omit largeDataDeliveryS3Config entirely (optional; only for S3 large-data delivery, which this stack doesn't use). The 'required by API schema' comment was wrong. - Narrow ignoreErrorCodesMatching from '.*' to transient service errors only (Throttling/ServiceUnavailable/InternalServer) so a client-side misconfiguration fails the deploy loudly instead of disabling logging silently. - Tests: assert the CR never sends largeDataDeliveryS3Config and never uses a catch-all error ignore. - Docs: COST_ATTRIBUTION.md now tells operators to verify logging is on in the agent's Region (get-model-invocation-logging-configuration) and how to re-enable it, since metadata is only recorded when logging is active. Verified live: with logging on, invocation logs show requestMetadata.{user_id, repo,task_id} and the abca-bedrock-<task_id> session ARN — Tracks 1 and 2 both confirmed working end-to-end. Refs #215

…source (#215) With the empty-bucket validation error fixed, PutModelInvocationLoggingConfiguration now actually reaches Bedrock at deploy — and fails because the custom resource's Lambda role lacks iam:PassRole on BedrockLoggingRole (the role it hands to the Bedrock service to write the log group). This was masked by the earlier client-side ValidationException that ignoreErrorCodesMatching: '.*' swallowed. Add iam:PassRole scoped to the BedrockLoggingRole ARN (not a wildcard). Test asserts the grant is present. Refs #215

bgagent added 7 commits June 30, 2026 16:05

krokoko requested review from a team as code owners June 30, 2026 22:01

isadeks previously approved these changes Jun 30, 2026

View reviewed changes

krokoko dismissed isadeks’s stale review via 87abbda June 30, 2026 22:16

isadeks approved these changes Jun 30, 2026

View reviewed changes

krokoko enabled auto-merge June 30, 2026 22:17

krokoko added this pull request to the merge queue Jun 30, 2026

Merged via the queue into main with commit 53a13cb Jun 30, 2026
8 of 9 checks passed

krokoko deleted the feat/215-bedrock-cost-attribution branch June 30, 2026 22:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Bedrock cost attribution — session tags, request metadata, and operator FinOps guidance (#215)#521

feat: Bedrock cost attribution — session tags, request metadata, and operator FinOps guidance (#215)#521
krokoko merged 8 commits into
mainfrom
feat/215-bedrock-cost-attribution

krokoko commented Jun 30, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

krokoko commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What was implemented

Track 1 — session-tagged credentials

Track 2 — per-request metadata

Track 2 prerequisite fix — model-invocation logging now actually enables on deploy

Observability hardening (from review)

Version alignment

Docs

What was tested

Notes for reviewers

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

krokoko commented Jun 30, 2026 •

edited

Loading