feat: Bedrock cost attribution — session tags, request metadata, and operator FinOps guidance (#215)#521
Merged
Merged
Conversation
added 7 commits
June 30, 2026 16:05
Design for per-user/per-repo Bedrock spend attribution. Key finding: Bedrock is invoked by the Claude Code CLI subprocess, not the agent's boto3, so both tracks (IAM session tags + request metadata) are wired via Claude Code config (awsCredentialExport, ANTHROPIC_CUSTOM_HEADERS) and a new BedrockInvokeRole — not by extending aws_session.py. Refs #215
…ata (#215) Attribute Bedrock model-inference spend per user/repo. Bedrock is invoked by the Claude Code subprocess (CLAUDE_CODE_USE_BEDROCK=1), so attribution is wired through Claude Code's config, not the agent's boto3. Track 1 — IAM session-tag chargeback (CUR 2.0 / Cost Explorer): - Grant bedrock:InvokeModel* on the existing AgentSessionRole (reuse, not a new role) via grantInvoke, mirroring the compute-role grant exactly so cross-region profiles never AccessDenied. Compute role keeps its grant. - bedrock_creds_helper.py assumes the SessionRole with {user_id,repo,task_id} STS tags and emits creds JSON for Claude Code's awsCredentialExport, which refreshes before the 1h role-chaining cap. Fails OPEN to ambient creds (billing control, not isolation). awsCredentialExport lives in root-owned /etc/claude-code/managed-settings.json so the untrusted repo can't override it (RCE boundary). Track 2 — per-call forensics (model-invocation logs): - Set X-Amzn-Bedrock-Request-Metadata via ANTHROPIC_CUSTOM_HEADERS on the subprocess env (one container = one task, so static-per-process is per-task; process-env so the repo can't alter it). SigV4 signed-headers behavior to be validated live (AC#3 documented-blocker path). Track 3 — operator guide COST_ATTRIBUTION.md + cross-links, plus a prominent warning that in-app cost_usd is a client-side SDK estimate (authoritative source is AWS Cost Explorer / CUR 2.0), mirroring the Claude Agent SDK cost-tracking caveat. Align claude-agent-sdk 0.2.110 (bundles CLI 2.1.191) with the npm CLI pin. Tests: CDK Bedrock grant present/absent; helper assume + fail-open paths; runner file+header wiring. #211 tenant-isolation path untouched. Refs #215
) PR #434 replaces the six named model/profile bindings in agent.ts with a loop over a single source-of-truth id list. Our #215 SessionRole grant referenced those bindings by name, so the merge would break compilation. Adopt #434's loop+collection shape now: build each foundation model + its cross-region profile in a loop, grant the runtime, and collect into one list passed to AgentSessionRole.invokableModels. Behavior is byte-for-byte identical in synth; the eventual #434 merge becomes a one-line swap of the local id array for resolveBedrockModelIds(this.node). Refs #215, #434
…review) Silent-failure review flagged that bedrock_creds_helper.py degraded silently: a persistent assume-role denial would drop chargeback for weeks with no signal pointing back to this code — the 'invisible degradation' AI004 forbids even when the fallback itself is intended. - Add _warn() (stderr only — stdout is the credential channel Claude Code parses, so shell.log/fd1 is unusable here). - Log every fail-open path; distinguish severities: absent file (benign) vs present-but-unreadable (write bug), and expected ClientError/BotoCoreError assume failure vs UNEXPECTED errors. - Narrow the assume catch to (ClientError, BotoCoreError); catch ImportError on boto3 separately (packaging defect, not AccessDenied). All still fail open. Behavior unchanged (still fail-open to ambient creds); degradations are now visible and correlatable. Tests cover each distinguished path + its diagnostic. Refs #215
Security review (LOW/accepted): unlike tenant-data tags, the request-metadata header lives on os.environ because Claude Code reads it from there. Document why that's safe (self-referential non-secret values; json.dumps escaping blocks header injection) in both the code and the design doc, so it reads as intent rather than an oversight against the 'tenant ids out of os.environ' discipline. Refs #215
The IAM-principal tag keys can't be pre-activated — they only appear in the Billing console after the platform makes tagged Bedrock calls. Fix the ordering (deploy → run task → wait ≤24h → activate), point to Billing → Cost allocation tags (not Tag Editor / Resource Groups, which lists resource types), and note the capability may not be enabled in every account/region yet. Refs #215
The ModelInvocationLogging custom resource sent largeDataDeliveryS3Config with an empty bucketName. Bedrock rejects that client-side (ValidationException, 'min length: 3'), and ignoreErrorCodesMatching: '.*' swallowed it while onUpdate never re-fired (static props) — so a fresh deploy silently left model-invocation logging DISABLED, and Bedrock recorded no requestMetadata (#215 Track 2 produced nothing to query). Found during live verification of task 01KWD7S.... - Omit largeDataDeliveryS3Config entirely (optional; only for S3 large-data delivery, which this stack doesn't use). The 'required by API schema' comment was wrong. - Narrow ignoreErrorCodesMatching from '.*' to transient service errors only (Throttling/ServiceUnavailable/InternalServer) so a client-side misconfiguration fails the deploy loudly instead of disabling logging silently. - Tests: assert the CR never sends largeDataDeliveryS3Config and never uses a catch-all error ignore. - Docs: COST_ATTRIBUTION.md now tells operators to verify logging is on in the agent's Region (get-model-invocation-logging-configuration) and how to re-enable it, since metadata is only recorded when logging is active. Verified live: with logging on, invocation logs show requestMetadata.{user_id, repo,task_id} and the abca-bedrock-<task_id> session ARN — Tracks 1 and 2 both confirmed working end-to-end. Refs #215
isadeks
previously approved these changes
Jun 30, 2026
…source (#215) With the empty-bucket validation error fixed, PutModelInvocationLoggingConfiguration now actually reaches Bedrock at deploy — and fails because the custom resource's Lambda role lacks iam:PassRole on BedrockLoggingRole (the role it hands to the Bedrock service to write the log group). This was masked by the earlier client-side ValidationException that ignoreErrorCodesMatching: '.*' swallowed. Add iam:PassRole scoped to the BedrockLoggingRole ARN (not a wildcard). Test asserts the grant is present. Refs #215
isadeks
approved these changes
Jun 30, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Implements Bedrock cost attribution (#215): attribute Bedrock model-inference spend per user and repo, complementing the in-app
cost_usdmeter and #211 per-session tenant isolation.The key architectural fact driving the design: Bedrock is invoked by the Claude Code CLI subprocess (
CLAUDE_CODE_USE_BEDROCK=1), not by the agent's boto3. So attribution cannot be wired throughaws_session.py(which scopes DynamoDB/S3 tenant data) — both levers live in Claude Code's own configuration, set by the agent before it spawns the subprocess.bedrock:InvokeModel*granted to the existingAgentSessionRole; Claude Code'sawsCredentialExportrunsbedrock_creds_helper.py, which assumes that role with{user_id, repo, task_id}STS session tagsiamPrincipal/prefix), aggregatedX-Amzn-Bedrock-Request-Metadataset viaANTHROPIC_CUSTOM_HEADERSon the subprocess envrequestMetadatafield), per calldocs/guides/COST_ATTRIBUTION.md+ cross-linksTracks 1 and 2 are complementary (per AWS docs): session tags give aggregated billing chargeback (they are not written to invocation logs); request metadata gives per-call detail in logs (it is not a cost-allocation tag). You need both.
What was implemented
Track 1 — session-tagged credentials
cdk/src/constructs/agent-session-role.ts— new optionalinvokableModelsprop. Each invokable isgrantInvoke-ed to the SessionRole — the same grant the compute role receives, reused (not hand-rolled ARNs) so cross-region inference profiles fan out to every routed region and can'tAccessDenied. Noaws:PrincipalTagcondition (the tags are for billing, not access scoping). Scoped to explicit model/profile ARNs — neverResource:'*'.cdk/src/stacks/agent.ts— builds the invokable set in one loop (rebased onto feat(cdk): single source of truth for invocable Bedrock models, context-overridable (#433) #434'sresolveBedrockModelIds), grants both the runtime and the SessionRole from the same list so the two grants can't drift.agent/src/bedrock_creds_helper.py(new) — invoked byawsCredentialExport. Reads a 0600 file (SessionRole ARN + tags), assumes the role, emits{"Credentials":{…,"Expiration":…}}. The realExpirationdrives Claude Code's pre-expiry refresh, beating the 1 h role-chaining cap on long tasks. Fails open to ambient compute-role credentials (this is a billing/observability control, not tenant isolation — contrastaws_session.py's fail-closed path).agent/managed-settings.json+ Dockerfile —awsCredentialExportlives in root-owned/etc/claude-code/managed-settings.json(copied beforeUSER agent). Highest-precedence settings tier, loaded regardless ofsetting_sources=["project"], so the untrusted cloned repo cannot override it (it runs an arbitrary command — putting it anywhere the repo can influence would be RCE with the compute role).agent/src/aws_session.py— extractedbuild_session_tags()so the in-process tenant path and the out-of-process Bedrock helper mint identical tags from one definition.Track 2 — per-request metadata
agent/src/runner.py—_setup_bedrock_cost_attribution()writes the attribution file and setsANTHROPIC_CUSTOM_HEADERSon the process env (deliberate, documented exception to the "tenant ids out ofos.environ" rule — the values are self-referential and non-secret;json.dumpsescaping blocks header injection).Track 2 prerequisite fix — model-invocation logging now actually enables on deploy
Found during live verification: the
ModelInvocationLoggingcustom resource sentlargeDataDeliveryS3Configwith an emptybucketName, which Bedrock rejects client-side (ValidationException,min length: 3). WithignoreErrorCodesMatching: '.*'swallowing it andonUpdatenever re-firing (static props), a fresh deploy silently left model-invocation logging disabled — so Bedrock recorded norequestMetadataand Track 2 produced nothing to query.largeDataDeliveryS3Configentirely (optional; only for S3 large-data delivery, unused here).ignoreErrorCodesMatchingfrom.*to transient service errors (Throttling/ServiceUnavailable/InternalServer) so a client-side misconfiguration fails the deploy loudly instead of disabling logging silently.iam:PassRoleonBedrockLoggingRoleto the custom resource's role.PutModelInvocationLoggingConfigurationhands that role to the Bedrock service (to write the log group), so the caller needsPassRoleon it. This was a second latent bug also masked by the.*ignore — narrowing the ignore made it surface and fail the deploy (as intended); fixed here, scoped to the one role ARN (not a wildcard).Observability hardening (from review)
agent/src/bedrock_creds_helper.py— every fail-open path now logs to stderr (stdout is the credential channel Claude Code parses). Distinguishes severities: absent file (benign) vs present-but-unreadable (a write bug); expectedClientError/BotoCoreErrorassume failure vsUNEXPECTEDerrors;ImportErroron boto3 (packaging defect). All still fail open — but a persistent degradation is now visible, not invisible.Version alignment
claude-agent-sdk==0.2.110(pyproject) ↔ npm@anthropic-ai/claude-code@2.1.191(Dockerfile) pinned in lockstep — the SDK bundles a CLI and both must agree on the control protocol; 2.1.191 also has theawsCredentialExport-with-Expirationrefresh behavior the design relies on.Docs
docs/design/BEDROCK_COST_ATTRIBUTION.mdanddocs/guides/COST_ATTRIBUTION.md(operator FinOps guide), cross-linked fromCOST_MODEL.mdandDEPLOYMENT_GUIDE.md. Includes a prominent warning that in-appcost_usdis a client-side SDK estimate, not authoritative billing (mirroring the Claude Agent SDK cost-tracking caveat, adapted for Bedrock → authoritative source is AWS Cost Explorer/CUR), the correct (post-deploy, non-pre-activatable) cost-allocation-tag ordering, and how to verify/re-enable model-invocation logging. Starlight mirrors synced.What was tested
Automated — full suites green:
agent-session-role.test.tsasserts the Bedrock grant is present (scoped, noResource:'*') wheninvokableModelsis set and absent when omitted;agent.test.tsregression guards that the logging custom resource never sendslargeDataDeliveryS3Config, never uses a catch-all error ignore, and grantsiam:PassRoleon the logging role.test_bedrock_creds_helper.pycovers the tagged assume + session name, all fail-open paths (absent/corrupt config,ClientError, unexpected error, no-creds), 0600 file mode, and the stderr diagnostics;test_runner.pycovers attribution-file write + header assembly.cdk synthclean (cdk-nag passes).Manual review — ran the PR-review toolkit (code-reviewer, silent-failure-hunter) and a security review. No CRITICAL/HIGH findings. The silent-failure review drove the observability hardening above. Security review verified the RCE boundary (root-owned managed-settings, repo can't override), IAM least-privilege (scoped grant, no wildcard), 0600 atomic write, no secret logging, and that
json.dumpsdefeats header injection.Live verification (deployed dev stack, us-east-1): a real agent task's Bedrock calls show all three metadata fields in the invocation logs, signed by the session-tagged role — proving both tracks end-to-end (Track 2 via
requestMetadata, Track 1 via theabca-bedrock-<task_id>session ARN). This also resolves the one risk flagged in the design as unverified: Claude Code does sign theX-Amzn-Bedrock-Request-Metadataheader. Redacted sample log record:{ "requestMetadata": { "user_id": "<redacted-cognito-sub>", "repo": "<owner>/<repo>", "task_id": "<task-ulid>" }, "modelId": "arn:aws:bedrock:us-east-1:<account>:inference-profile/us.anthropic.claude-sonnet-4-6", "identity": { "arn": "arn:aws:sts::<account>:assumed-role/<stack>-AgentSessionRole<id>/abca-bedrock-<task-ulid>" } }Clean-deploy verification of the logging fixes: I reset the account's model-invocation logging config to empty, then ran
cdk deployof this branch. TheModelInvocationLoggingcustom resource wentUPDATE_COMPLETE(previouslyUPDATE_FAILED), and the live config came back enabled by the deploy itself (pointing at the stack's own log group +BedrockLoggingRole) — confirming the full chain end-to-end: empty-bucket error removed → masking narrowed →iam:PassRolegranted. StackUPDATE_COMPLETE.Notes for reviewers
mainafter feat(cdk): single source of truth for invocable Bedrock models, context-overridable (#433) #434 (configurable Bedrock models) merged — the invokable-model loop derives fromresolveBedrockModelIds, so the two grant sites stay in lockstep with feat(cdk): single source of truth for invocable Bedrock models, context-overridable (#433) #434's single source of truth.cdk/cdk.jsonis intentionally not included (local-testing context only).Closes #215