Skip to content

feat: add monitors#64

Open
maxday wants to merge 15 commits intomainfrom
maxday/add-monitors
Open

feat: add monitors#64
maxday wants to merge 15 commits intomainfrom
maxday/add-monitors

Conversation

@maxday
Copy link
Copy Markdown
Member

@maxday maxday commented Apr 22, 2026

Add CloudWatch Monitoring for Integration Tests

This PR implements automated CloudWatch monitoring for the Ruby Runtime Interface Client integration tests to ensure continuous test health across all supported configurations.

What's Changed

  1. Centralized Test Matrix Configuration
    • Added .github/test-matrix.json to define all test permutations in a single source of truth
    • Covers 16 combinations: 2 architectures (x64, arm64) × 4 distros (AL2023, Alpine, Debian, Ubuntu) × 2 Ruby versions (3.3, 3.4)
    • Refactored workflows to share the same matrix logic, eliminating duplication

  2. CloudWatch Alarm Infrastructure
    • New workflow: .github/workflows/bootstrap-alarms.yml
    • Bootstraps the Runtime AWS account with individual CloudWatch alarms for each test permutation (dynamically computed from the JSON matrix)
    • Creates a composite aggregate alarm that triggers if ANY individual alarm fails or has insufficient data
    • Alarms trigger if no successful test metric is received within 3 days (uses 1-day evaluation periods for faster state transitions)
    • Idempotent operations: re-running won't destroy existing alarms
    • Runs on pull requests and can be manually triggered

  3. Enhanced Integration Tests
    • Refactored .github/workflows/integration-tests.yml to use the shared test matrix
    • Added scheduled runs every workday (Mon-Fri at 08:00 UTC) to match our pipeline freshness policy
    • Ensures the RIC works with newer versions of base OS on a daily basis
    • Integrated AWS OIDC authentication for CloudWatch access
    • Publishes success metrics to CloudWatch after each successful test run
    • Metrics include dimensions: Distro, DistroVersion, RuntimeVersion, and Arch
    • No-data scenarios trigger alarms, preventing silent failures from going unnoticed

  4. Configuration
    • Added required secrets: AWS_ALARM_TARGET_ARN, AWS_OIDC_ROLE_ARN, and AWS_REGION
    • Current alarm action: SNS → email notifications to the team
    • Future enhancement: Auto SIM ticket creation

Benefits

• Proactive monitoring: Get alerted when tests stop running or start failing consistently
• Centralized visibility: Single aggregate alarm for the entire test suite
• Daily validation: Continuous verification that RIC works with latest base OS versions
• Reduced duplication: Test matrix defined once and shared across workflows
• No silent failures: Missing metrics trigger alarms

Screenshot 2026-04-22 at 17 56 22

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@maxday maxday marked this pull request as ready for review April 22, 2026 18:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants