feat(lambda): cross-Lambda installation token cache via DynamoDB#5132
Open
vegardx wants to merge 2 commits into
Open
feat(lambda): cross-Lambda installation token cache via DynamoDB#5132vegardx wants to merge 2 commits into
vegardx wants to merge 2 commits into
Conversation
Add a DynamoDB-backed cache for GitHub App installation access tokens.
Previously every Lambda invocation minted a fresh token via POST
/app/installations/{id}/access_tokens — under burst load this produces
thousands of redundant token-mint calls per minute, triggering rate
limits and secondary rate limit responses from GitHub.
The cache provides:
- Shared token across all concurrent Lambda invocations (scale-up, pool)
- Refresh-ahead at T-10min with conditional-write locking (single-flight)
- Graceful degradation: DDB read failures fall through to direct mint
- Lock TTL backoff: on mint failure the lock expires naturally (60s),
capping retry storms against a struggling upstream
DynamoDB table:
- PAY_PER_REQUEST billing (~$0 at this access pattern)
- TTL-enabled for automatic cleanup
- One table shared across all runner configs (multi-runner)
The table is always created (no feature flag). The env var
INSTALLATION_TOKEN_TABLE_NAME is always set. The cache is transparent:
same token scope, same semantics, just fewer API calls.
Refs: github-aws-runners#5037, github-aws-runners#3199, github-aws-runners#4710
Contributor
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Adds a DynamoDB-backed cache for GitHub App installation tokens to reduce repeated token minting across Lambda invocations, including Terraform resources/IAM wiring and Node dependencies/tests.
Changes:
- Introduces a DynamoDB table for caching installation tokens (with TTL + SSE).
- Wires table name/ARN into runner Lambdas via env vars and adds IAM permissions.
- Adds a control-plane token cache implementation + Vitest coverage and DynamoDB SDK dependency.
Reviewed changes
Copilot reviewed 12 out of 13 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| token-cache.tf | Creates a DynamoDB table for installation token caching in the root stack. |
| main.tf | Passes installation token table outputs into the runners module. |
| modules/multi-runner/token-cache.tf | Creates a DynamoDB table for token caching inside the multi-runner module. |
| modules/multi-runner/runners.tf | Passes token table name/ARN into the runners submodule. |
| modules/runners/variables.tf | Adds required inputs for the token cache table name/ARN. |
| modules/runners/scale-up.tf | Exposes table name to the scale-up Lambda and grants DynamoDB access. |
| modules/runners/pool.tf | Propagates token cache table name/ARN into the pool submodule config. |
| modules/runners/pool/main.tf | Exposes table name to the pool Lambda and grants DynamoDB access. |
| lambdas/functions/control-plane/package.json | Adds @aws-sdk/client-dynamodb dependency. |
| lambdas/yarn.lock | Locks new DynamoDB client and transitive AWS SDK dependencies. |
| lambdas/functions/control-plane/src/github/token-cache.ts | Implements DynamoDB-backed cache + locking/single-flight for token minting. |
| lambdas/functions/control-plane/src/github/token-cache.test.ts | Adds unit tests for cache hit/refresh-ahead/cold-miss flows. |
| lambdas/functions/control-plane/src/github/auth.ts | Integrates token cache into installation token auth creation. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
When UpdateItem creates a lock-only record (no token stored yet), it now also sets the ttl attribute so DynamoDB auto-deletes it if the holder crashes and never writes a full token entry.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Every Lambda invocation (scale-up, pool) mints a fresh GitHub App installation access token via
POST /app/installations/{id}/access_tokens. Tokens are valid for 60 minutes, but the module discards them after each invocation — there is no cross-invocation caching.Under burst load this produces thousands of redundant token-mint calls per minute. Users report hitting rate limits as low as 10-50 concurrent runners (#3199), with the problem becoming severe at scale (#5037). The token-mint endpoint is subject to both primary rate limits and secondary (abuse) rate limits, which manifest as 403s or delayed 404s.
At 10 runner configs × batch_size 10, a burst of 100 workflow jobs produces ~100 token mints in seconds — all for the same token.
Relevant GitHub API rate limits
POST /app/installations/{id}/access_tokensPOST /orgs/{org}/actions/runners/registration-tokenactions_runner_registrationbucket)The installation access token endpoint shares the App's 5,000 req/hour JWT-authenticated budget with all other App-level calls (listing installations, getting repo info, etc.). Under burst load, 100+ concurrent token mints can also trigger the secondary rate limit (100 concurrent requests max), resulting in 403s or 502s before the hourly budget is even exhausted.
With the cache: ~1 mint/hour regardless of concurrency. The entire hourly budget is preserved for actual API work.
Solution
A DynamoDB table that caches installation tokens across all Lambda invocations. One token mint per ~50 minutes (refresh-ahead), shared by all concurrent Lambdas.
Why this should be default-on (no feature flag)
repositoryIdsnarrowing)How it works
sequenceDiagram participant A as Lambda A (scale-up) participant DDB as DynamoDB participant GH as GitHub API Note over A,GH: Case A: Fresh cache hit A->>DDB: GetItem(installation_id) DDB-->>A: token (expires in 30min) Note right of A: Return cached token Note over A,GH: Case B: Refresh-ahead (token expiring soon) participant B as Lambda B (scale-up) participant C as Lambda C (concurrent) B->>DDB: GetItem(installation_id) DDB-->>B: token (expires in 5min) B->>DDB: UpdateItem (acquire lock) DDB-->>B: lock acquired ✓ B->>GH: POST /access_tokens GH-->>B: new token + expiresAt B->>DDB: PutItem (store token, clear lock) C->>DDB: GetItem(installation_id) DDB-->>C: token (still valid, 5min left) Note right of C: Return cached token (no mint needed) Note over A,GH: Case C: Cold miss A->>DDB: GetItem(installation_id) DDB-->>A: ∅ (no item) A->>DDB: UpdateItem (acquire lock) DDB-->>A: lock acquired ✓ A->>GH: POST /access_tokens GH-->>A: token + expiresAt A->>DDB: PutItem (store token)Three cases:
On mint failure the lock expires naturally after 60s — caps retry storms.
Changes
Lambda (TypeScript)
lambdas/functions/control-plane/src/github/token-cache.ts— cache module with lockinglambdas/functions/control-plane/src/github/token-cache.test.ts— 8 tests covering all pathslambdas/functions/control-plane/src/github/auth.ts— integration: route through cache whenINSTALLATION_TOKEN_TABLE_NAMEis setlambdas/functions/control-plane/package.json— add@aws-sdk/client-dynamodbTerraform
token-cache.tf(root module) — DynamoDB table for single-runner deploymentsmodules/multi-runner/token-cache.tf— shared table for multi-runner deploymentsmodules/runners/variables.tf— newinstallation_token_table_name/_arnvariablesmodules/runners/scale-up.tf— env var + IAM policy for scale-up Lambdamodules/runners/pool.tf+modules/runners/pool/main.tf— same for pool LambdaDynamoDB schema
installation_idtokenexpires_at_mslock_until_msttlImpact
Refs: #5037, #3199, #4710