Skip to content

feat(lambda): cross-Lambda installation token cache via DynamoDB#5132

Open
vegardx wants to merge 2 commits into
github-aws-runners:mainfrom
vegardx:feat/installation-token-cache-dynamodb
Open

feat(lambda): cross-Lambda installation token cache via DynamoDB#5132
vegardx wants to merge 2 commits into
github-aws-runners:mainfrom
vegardx:feat/installation-token-cache-dynamodb

Conversation

@vegardx
Copy link
Copy Markdown

@vegardx vegardx commented May 26, 2026

Problem

Every Lambda invocation (scale-up, pool) mints a fresh GitHub App installation access token via POST /app/installations/{id}/access_tokens. Tokens are valid for 60 minutes, but the module discards them after each invocation — there is no cross-invocation caching.

Under burst load this produces thousands of redundant token-mint calls per minute. Users report hitting rate limits as low as 10-50 concurrent runners (#3199), with the problem becoming severe at scale (#5037). The token-mint endpoint is subject to both primary rate limits and secondary (abuse) rate limits, which manifest as 403s or delayed 404s.

At 10 runner configs × batch_size 10, a burst of 100 workflow jobs produces ~100 token mints in seconds — all for the same token.

Relevant GitHub API rate limits

Endpoint Primary limit Secondary limit Notes
POST /app/installations/{id}/access_tokens 5,000 req/hour (shared JWT budget) 900 points/min, 100 concurrent This is what the cache targets
POST /orgs/{org}/actions/runners/registration-token 10,000 req/hour (actions_runner_registration bucket) 900 points/min, 100 concurrent JIT config calls; unaffected by this PR

The installation access token endpoint shares the App's 5,000 req/hour JWT-authenticated budget with all other App-level calls (listing installations, getting repo info, etc.). Under burst load, 100+ concurrent token mints can also trigger the secondary rate limit (100 concurrent requests max), resulting in 403s or 502s before the hourly budget is even exhausted.

With the cache: ~1 mint/hour regardless of concurrency. The entire hourly budget is preserved for actual API work.

Solution

A DynamoDB table that caches installation tokens across all Lambda invocations. One token mint per ~50 minutes (refresh-ahead), shared by all concurrent Lambdas.

Why this should be default-on (no feature flag)

  1. Zero risk of behavioral change — the cached token has identical scope to a freshly-minted one (full installation scope, no repositoryIds narrowing)
  2. Graceful degradation — if DDB is unreachable, the code falls through to direct mint (same as today)
  3. Effectively free — PAY_PER_REQUEST DynamoDB at ~1 write/hour + a few reads/minute costs < $0.01/month
  4. The alternative (multiple GitHub Apps, Support for multiple GitHub Apps to overcome API rate limits at scale #5037) is operationally complex — requires splitting orgs, managing multiple app installations, and routing logic
  5. Every deployment benefits — even small deployments avoid unnecessary API calls; large deployments avoid rate limit failures

How it works

sequenceDiagram
    participant A as Lambda A (scale-up)
    participant DDB as DynamoDB
    participant GH as GitHub API

    Note over A,GH: Case A: Fresh cache hit
    A->>DDB: GetItem(installation_id)
    DDB-->>A: token (expires in 30min)
    Note right of A: Return cached token

    Note over A,GH: Case B: Refresh-ahead (token expiring soon)
    participant B as Lambda B (scale-up)
    participant C as Lambda C (concurrent)

    B->>DDB: GetItem(installation_id)
    DDB-->>B: token (expires in 5min)
    B->>DDB: UpdateItem (acquire lock)
    DDB-->>B: lock acquired ✓
    B->>GH: POST /access_tokens
    GH-->>B: new token + expiresAt
    B->>DDB: PutItem (store token, clear lock)

    C->>DDB: GetItem(installation_id)
    DDB-->>C: token (still valid, 5min left)
    Note right of C: Return cached token (no mint needed)

    Note over A,GH: Case C: Cold miss
    A->>DDB: GetItem(installation_id)
    DDB-->>A: ∅ (no item)
    A->>DDB: UpdateItem (acquire lock)
    DDB-->>A: lock acquired ✓
    A->>GH: POST /access_tokens
    GH-->>A: token + expiresAt
    A->>DDB: PutItem (store token)
Loading

Three cases:

  • A. Fresh hit (>10min to expiry): return cached, no GitHub call
  • B. Refresh-ahead (<10min to expiry): one Lambda wins lock, mints, others return still-valid cached token
  • C. Cold miss: one Lambda wins lock, mints; others wait briefly then read from cache

On mint failure the lock expires naturally after 60s — caps retry storms.

Changes

Lambda (TypeScript)

  • lambdas/functions/control-plane/src/github/token-cache.ts — cache module with locking
  • lambdas/functions/control-plane/src/github/token-cache.test.ts — 8 tests covering all paths
  • lambdas/functions/control-plane/src/github/auth.ts — integration: route through cache when INSTALLATION_TOKEN_TABLE_NAME is set
  • lambdas/functions/control-plane/package.json — add @aws-sdk/client-dynamodb

Terraform

  • token-cache.tf (root module) — DynamoDB table for single-runner deployments
  • modules/multi-runner/token-cache.tf — shared table for multi-runner deployments
  • modules/runners/variables.tf — new installation_token_table_name / _arn variables
  • modules/runners/scale-up.tf — env var + IAM policy for scale-up Lambda
  • modules/runners/pool.tf + modules/runners/pool/main.tf — same for pool Lambda

DynamoDB schema

Attribute Type Purpose
installation_id N (hash key) GitHub App installation ID
token S Cached access token
expires_at_ms N Token expiry (epoch ms)
lock_until_ms N Mint-in-progress lock expiry
ttl N DynamoDB TTL (epoch seconds)

Impact

Metric Before After
Token mints per hour N (= total Lambda invocations) ~1 per installation
GitHub API calls during burst 100s-1000s of redundant mints 1 mint + reads from DDB
Cost of cache infrastructure N/A ~$0/month (PAY_PER_REQUEST)
Failure mode if DDB is down N/A Falls through to direct mint

Refs: #5037, #3199, #4710

Add a DynamoDB-backed cache for GitHub App installation access tokens.
Previously every Lambda invocation minted a fresh token via POST
/app/installations/{id}/access_tokens — under burst load this produces
thousands of redundant token-mint calls per minute, triggering rate
limits and secondary rate limit responses from GitHub.

The cache provides:
- Shared token across all concurrent Lambda invocations (scale-up, pool)
- Refresh-ahead at T-10min with conditional-write locking (single-flight)
- Graceful degradation: DDB read failures fall through to direct mint
- Lock TTL backoff: on mint failure the lock expires naturally (60s),
  capping retry storms against a struggling upstream

DynamoDB table:
- PAY_PER_REQUEST billing (~$0 at this access pattern)
- TTL-enabled for automatic cleanup
- One table shared across all runner configs (multi-runner)

The table is always created (no feature flag). The env var
INSTALLATION_TOKEN_TABLE_NAME is always set. The cache is transparent:
same token scope, same semantics, just fewer API calls.

Refs: github-aws-runners#5037, github-aws-runners#3199, github-aws-runners#4710
Copilot AI review requested due to automatic review settings May 26, 2026 20:43
@vegardx vegardx requested review from a team as code owners May 26, 2026 20:43
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds a DynamoDB-backed cache for GitHub App installation tokens to reduce repeated token minting across Lambda invocations, including Terraform resources/IAM wiring and Node dependencies/tests.

Changes:

  • Introduces a DynamoDB table for caching installation tokens (with TTL + SSE).
  • Wires table name/ARN into runner Lambdas via env vars and adds IAM permissions.
  • Adds a control-plane token cache implementation + Vitest coverage and DynamoDB SDK dependency.

Reviewed changes

Copilot reviewed 12 out of 13 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
token-cache.tf Creates a DynamoDB table for installation token caching in the root stack.
main.tf Passes installation token table outputs into the runners module.
modules/multi-runner/token-cache.tf Creates a DynamoDB table for token caching inside the multi-runner module.
modules/multi-runner/runners.tf Passes token table name/ARN into the runners submodule.
modules/runners/variables.tf Adds required inputs for the token cache table name/ARN.
modules/runners/scale-up.tf Exposes table name to the scale-up Lambda and grants DynamoDB access.
modules/runners/pool.tf Propagates token cache table name/ARN into the pool submodule config.
modules/runners/pool/main.tf Exposes table name to the pool Lambda and grants DynamoDB access.
lambdas/functions/control-plane/package.json Adds @aws-sdk/client-dynamodb dependency.
lambdas/yarn.lock Locks new DynamoDB client and transitive AWS SDK dependencies.
lambdas/functions/control-plane/src/github/token-cache.ts Implements DynamoDB-backed cache + locking/single-flight for token minting.
lambdas/functions/control-plane/src/github/token-cache.test.ts Adds unit tests for cache hit/refresh-ahead/cold-miss flows.
lambdas/functions/control-plane/src/github/auth.ts Integrates token cache into installation token auth creation.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread modules/runners/variables.tf
Comment thread modules/runners/scale-up.tf
Comment thread modules/runners/pool/main.tf
Comment thread lambdas/functions/control-plane/src/github/token-cache.ts
Comment thread lambdas/functions/control-plane/src/github/token-cache.test.ts
Comment thread token-cache.tf
Comment thread lambdas/functions/control-plane/src/github/token-cache.ts
When UpdateItem creates a lock-only record (no token stored yet), it now
also sets the ttl attribute so DynamoDB auto-deletes it if the holder
crashes and never writes a full token entry.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants