Problem
Our AWS, Azure, and Docker integration tests all call real external services. This causes:
- Flaky CI — transient rate limits (e.g. AWS Lambda 429s), network issues, and external outages cause test failures unrelated to code changes
- Slow feedback — waiting on real API calls adds significant time to test runs
- Credential dependency — tests can't run without real credentials and infrastructure (Lambda functions, ECS clusters, S3 buckets, Azure apps)
- Poor edge case coverage — hard to test error paths like "what happens when Lambda returns a 500 on the 3rd page of results?"
Proposed approach
Introduce a two-layer testing strategy:
Layer 1: Mock/unit tests (fast, run on every PR)
- Introduce interfaces for cloud service clients (e.g. a
LambdaAPI interface with ListFunctions and GetFunctionConfiguration)
- Write mock implementations for use in tests
- Cover logic, edge cases, and error paths: pagination, concurrent goroutine behavior, filtering, partial failures
Layer 2: Integration tests (slow, run on main/daily)
- Keep a small set of smoke tests per cloud provider that prove real wiring works (credentials, API compatibility, serialization)
- These can run less frequently — on pushes to main or via
daily-cli-tests.yml
Packages to target (in priority order)
internal/aws — Lambda, ECS, S3 (highest flake rate due to rate limiting)
internal/azure — Azure Apps
internal/docker — Docker/OCI
internal/kube — Kubernetes
Context
We recently switched AWS to adaptive retry mode (PR #757) which mitigates rate limiting, but the underlying problem remains: we have no fast, isolated test layer for cloud integrations.
Problem
Our AWS, Azure, and Docker integration tests all call real external services. This causes:
Proposed approach
Introduce a two-layer testing strategy:
Layer 1: Mock/unit tests (fast, run on every PR)
LambdaAPIinterface withListFunctionsandGetFunctionConfiguration)Layer 2: Integration tests (slow, run on main/daily)
daily-cli-tests.ymlPackages to target (in priority order)
internal/aws— Lambda, ECS, S3 (highest flake rate due to rate limiting)internal/azure— Azure Appsinternal/docker— Docker/OCIinternal/kube— KubernetesContext
We recently switched AWS to adaptive retry mode (PR #757) which mitigates rate limiting, but the underlying problem remains: we have no fast, isolated test layer for cloud integrations.