From 3b236e6343aed729369f94226ddc9524248ce4e9 Mon Sep 17 00:00:00 2001 From: Paul Santus Date: Wed, 24 Sep 2025 15:49:51 +0200 Subject: [PATCH] Add XRay to Cloudwatch Logs Insights tool --- .../README.md | 20 ++ .../aws-xray-to-cloudwatch-logs-insights.md | 197 ++++++++++++++ .../aws-xray-to-cloudwatch-logs-insights.sh | 248 ++++++++++++++++++ 3 files changed, 465 insertions(+) create mode 100644 tuts/087-aws-xray-to-cloudwatch-logs-insights/README.md create mode 100644 tuts/087-aws-xray-to-cloudwatch-logs-insights/aws-xray-to-cloudwatch-logs-insights.md create mode 100755 tuts/087-aws-xray-to-cloudwatch-logs-insights/aws-xray-to-cloudwatch-logs-insights.sh diff --git a/tuts/087-aws-xray-to-cloudwatch-logs-insights/README.md b/tuts/087-aws-xray-to-cloudwatch-logs-insights/README.md new file mode 100644 index 0000000..b880b7f --- /dev/null +++ b/tuts/087-aws-xray-to-cloudwatch-logs-insights/README.md @@ -0,0 +1,20 @@ +# AWS X-Ray to CloudWatch Logs Insights + +This tutorial demonstrates how to analyze AWS X-Ray traces and generate CloudWatch Logs Insights queries to find related log entries across your entire AWS infrastructure. You'll learn how to extract trace relationships, visualize service architecture, and automatically create comprehensive log queries for distributed system troubleshooting. + +## Key Features + +The script provides the following capabilities: + +- **Trace Analysis**: Extracts actual timestamps and related trace IDs from X-Ray data +- **Service Mapping**: Visualizes service architecture with performance metrics +- **Query Generation**: Creates CloudWatch Logs Insights queries for all related traces +- **Automatic Execution**: Optionally runs queries and formats results +- **Error Handling**: Comprehensive validation and timeout protection + +## Usage Modes + +- **Query Generation**: `./aws-xray-to-cloudwatch-logs-insights.sh ` - Generates CloudWatch Logs Insights query +- **Service Visualization**: `./aws-xray-to-cloudwatch-logs-insights.sh --service-map` - Shows service architecture +- **Automatic Execution**: `./aws-xray-to-cloudwatch-logs-insights.sh --run` - Executes query and displays results +- **Full Analysis**: `./aws-xray-to-cloudwatch-logs-insights.sh --service-map --run` - Complete analysis workflow \ No newline at end of file diff --git a/tuts/087-aws-xray-to-cloudwatch-logs-insights/aws-xray-to-cloudwatch-logs-insights.md b/tuts/087-aws-xray-to-cloudwatch-logs-insights/aws-xray-to-cloudwatch-logs-insights.md new file mode 100644 index 0000000..274dfc0 --- /dev/null +++ b/tuts/087-aws-xray-to-cloudwatch-logs-insights/aws-xray-to-cloudwatch-logs-insights.md @@ -0,0 +1,197 @@ +# X-Ray Trace to CloudWatch Logs Insights + +A powerful shell script that helps you find all Cloudwatch Logs associated to an AWS X-Ray trace and all its related traces. + +## Prerequisites + +### Required Tools +- **AWS CLI v2** - Install from [AWS CLI Installation Guide](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html) +- **jq** - JSON processor for parsing AWS responses +- **bash** - Shell environment (macOS/Linux) +- **Standard Unix tools** - `date`, `base64` (usually pre-installed) + +### AWS Configuration +1. **Configure AWS credentials:** + + This tool assumes your AWS CLI is logged in. For instance, if you're using AWS IAM Identity Center + + ``` + aws sso login + export AWS_PROFILE = my-profile + ``` + +2. **Required IAM permissions:** + ```json + { + "Version": "2012-10-17", + "Statement": [ + { + "Effect": "Allow", + "Action": [ + "xray:BatchGetTraces", + "xray:GetTraceSummaries", + "xray:GetServiceGraph", + "logs:StartQuery", + "logs:GetQueryResults", + "logs:DescribeLogGroups" + ], + "Resource": "*" + } + ] + } + ``` + +## Usage + +### Basic Syntax +```bash +./aws-xray-to-cloudwatch-logs-insights.sh [--run] [--service-map] +``` + +### Parameters +- **``** - AWS X-Ray trace ID (format: `1-xxxxxxxx-xxxxxxxxxxxxxxxxxxxxxxxx`) +- **``** - Reference date (used for fallback, actual trace timestamp is auto-detected) +- **`--run`** - Execute the CloudWatch Logs query automatically +- **`--service-map`** - Display service architecture visualization + +## Examples + +### 1. Generate Query Only +```bash +./aws-xray-to-cloudwatch-logs-insights.sh "1-64f2b1c5-8a9e3d7f2b4c6e1a9f8d2c5b" "2024-12-15 14:30:22Z" +``` + +**Output:** +``` +CloudWatch Logs Insights Query: +================================ + +SOURCE logGroups() +| fields @timestamp, @message +| filter @message like /1-64f2b1c5-8a9e3d7f2b4c6e1a9f8d2c5b/ or @message like /1-64f2b1c4-7b8c2e5f3a6d9c1e4b7a8d2f/ +| sort @timestamp desc +| limit 1000 + +Time Range: 2024-12-15 14:25:22 to 2024-12-15 14:35:22 UTC +Related Traces Found: 15 +``` + +### 2. Show Service Architecture +```bash +./aws-xray-to-cloudwatch-logs-insights.sh "1-64f2b1c5-8a9e3d7f2b4c6e1a9f8d2c5b" "2024-12-15 14:30:22Z" --service-map +``` + +**Output:** +``` +Service Map: +════════════ +┌──────────────────────────────────────────────────────────────────────────────┐ +│ user-api-gateway │ +│ AWS::Lambda │ +│ Requests: 5 │ +│ Avg Time: 1250ms │ +└──────────────────────────────────────────────────────────────────────────────┘ + ↓ +┌──────────────────────────────────────────────────────────────────────────────┐ +│ order-processing-service │ +│ AWS::Lambda::Function │ +│ Requests: 3 │ +│ Avg Time: 850ms │ +└──────────────────────────────────────────────────────────────────────────────┘ + ↓ +┌──────────────────────────────────────────────────────────────────────────────┐ +│ https://sqs.us-east-1.amazonaws.com/123456789012/order-queue │ +│ AWS::SQS::Queue │ +│ Requests: 12 │ +│ Avg Time: 45ms │ +└──────────────────────────────────────────────────────────────────────────────┘ +``` + +### 3. Execute Query Automatically +```bash +./aws-xray-to-cloudwatch-logs-insights.sh "1-64f2b1c5-8a9e3d7f2b4c6e1a9f8d2c5b" "2024-12-15 14:30:22Z" --run +``` + +**Output:** +``` +Query completed. Results: +======================== +2024-12-15 14:30:45.123 | INFO Order processing started for customer 12345 +--- +2024-12-15 14:30:44.856 | DEBUG Validating payment method for order ORD-789 +--- +2024-12-15 14:30:44.234 | ERROR Payment validation failed: insufficient funds +--- +``` + +### 4. Full Analysis Mode +```bash +./aws-xray-to-cloudwatch-logs-insights.sh "1-64f2b1c5-8a9e3d7f2b4c6e1a9f8d2c5b" "2024-12-15 14:30:22Z" --service-map --run +``` + +Shows service architecture, then executes the query and displays formatted results. + +## How It Works + +### 1. Trace Analysis +- Fetches the original trace using `batch-get-traces` +- Extracts actual timestamp from trace data (ignores user-provided date) +- Creates ±5 minute time window around the trace + +### 2. Related Trace Discovery +- Analyzes trace links to find parent/child relationships +- Identifies truly related traces (not just concurrent ones) +- Builds comprehensive trace ID list + +### 3. Service Map Generation +- Calls `get-service-graph` for the time window +- Visualizes service architecture with performance metrics +- Shows request counts and average response times + +### 4. Log Query Generation +- Creates CloudWatch Logs Insights query using `SOURCE logGroups()` +- Searches across ALL log groups in your account +- Filters for messages containing any related trace IDs + +## Make it your own + +### Custom Time Windows +The script automatically uses the trace timestamp, but you can modify the time window by editing these lines: +```bash +START_TIME=$((START_TIME - 300)) # 5 minutes before +END_TIME=$((START_TIME + 600)) # 10 minutes total window +``` + +### Filtering Specific Log Groups +To search only specific log groups, modify the SOURCE command: +```bash +SOURCE logGroups(namePrefix: ['/aws/lambda/my-service']) +``` + +## Troubleshooting + +### Common Errors + +**"Invalid trace ID format"** +- Ensure trace ID follows format: `1-xxxxxxxx-xxxxxxxxxxxxxxxxxxxxxxxx` +- Check for typos or missing characters + +**"AWS credentials not configured"** +- Run `aws configure` to set up credentials +- Verify with `aws sts get-caller-identity` + +**"Failed to fetch trace data"** +- Check if trace ID exists and is in the correct region +- Verify IAM permissions for X-Ray access +- Ensure trace is not older than 30 days (X-Ray retention limit) + +### Performance Tips + +1. **Use service map first** to understand architecture scope +2. **Check trace age** - older traces may have limited data +3. **Monitor costs** - CloudWatch Logs Insights charges per GB scanned +4. **Filter log groups** for large AWS accounts + +## License + +This tool is provided as-is for educational and operational purposes. Use in accordance with your organization's AWS usage policies. diff --git a/tuts/087-aws-xray-to-cloudwatch-logs-insights/aws-xray-to-cloudwatch-logs-insights.sh b/tuts/087-aws-xray-to-cloudwatch-logs-insights/aws-xray-to-cloudwatch-logs-insights.sh new file mode 100755 index 0000000..97f66ce --- /dev/null +++ b/tuts/087-aws-xray-to-cloudwatch-logs-insights/aws-xray-to-cloudwatch-logs-insights.sh @@ -0,0 +1,248 @@ +#!/bin/bash + +# +# X-Ray Trace Log Analysis Tool +# +# Analyzes AWS X-Ray traces and generates CloudWatch Logs Insights queries +# to find related log entries across your entire AWS infrastructure. +# +# Features: +# - Automatic trace timestamp extraction and time window calculation +# - Related trace discovery through parent/child relationships +# - Service architecture visualization with performance metrics +# - CloudWatch Logs Insights query generation and execution +# - Searches across all log groups in your AWS account +# +# Usage: +# ./aws-xray-to-cloudwatch-logs-insights.sh [--run] [--service-map] +# +# Examples: +# ./aws-xray-to-cloudwatch-logs-insights.sh "1-64f2b1c5-8a9e3d7f2b4c6e1a9f8d2c5b" "2024-12-15 14:30:22Z" +# ./aws-xray-to-cloudwatch-logs-insights.sh "1-64f2b1c5-8a9e3d7f2b4c6e1a9f8d2c5b" "2024-12-15 14:30:22Z" --service-map --run +# +# Author: @Paul Santus +# Version: 1.0 +# + +set -euo pipefail # Exit on error, undefined vars, pipe failures + +# Check dependencies +command -v aws >/dev/null 2>&1 || { echo "Error: AWS CLI required but not installed"; exit 1; } +command -v jq >/dev/null 2>&1 || { echo "Error: jq required but not installed"; exit 1; } +command -v date >/dev/null 2>&1 || { echo "Error: date command required"; exit 1; } +command -v base64 >/dev/null 2>&1 || { echo "Error: base64 command required"; exit 1; } + +# Usage: ./xray-trace-logs.sh [--run] [--service-map] +if [ $# -lt 2 ] || [ $# -gt 4 ]; then + echo "Usage: $0 [--run] [--service-map]" + echo "Example: $0 1-68c1a2a4-254e272a518953ead4d8f44a '2025-09-10 16:09:14Z'" + echo " $0 1-68c1a2a4-254e272a518953ead4d8f44a '2025-09-10 16:09:14Z' --run" + echo " $0 1-68c1a2a4-254e272a518953ead4d8f44a '2025-09-10 16:09:14Z' --service-map" + echo " $0 1-68c1a2a4-254e272a518953ead4d8f44a '2025-09-10 16:09:14Z' --run --service-map" + exit 1 +fi + +TRACE_ID="$1" +USER_DATE="$2" + +# Input validation +if [[ ! "$TRACE_ID" =~ ^1-[0-9a-f]{8}-[0-9a-f]{24}$ ]]; then + echo "Error: Invalid trace ID format. Expected format: 1-xxxxxxxx-xxxxxxxxxxxxxxxxxxxxxxxx" + exit 1 +fi +RUN_QUERY="" +SHOW_SERVICE_MAP="" + +# Parse flags +for arg in "$@"; do + case $arg in + --run) + RUN_QUERY="--run" + ;; + --service-map) + SHOW_SERVICE_MAP="--service-map" + ;; + esac +done + +# Get AWS region from CLI profile +AWS_REGION=$(aws configure get region) +if [ -z "$AWS_REGION" ]; then + echo "Error: AWS region not configured. Run 'aws configure' first." + exit 1 +fi + +# Verify AWS credentials +if ! aws sts get-caller-identity >/dev/null 2>&1; then + echo "Error: AWS credentials not configured or invalid. Run 'aws configure' first." + exit 1 +fi + +# Get trace details to extract actual timestamp +echo "Fetching trace details for: $TRACE_ID" +if ! TRACE_DATA=$(aws xray batch-get-traces --region "$AWS_REGION" --trace-ids "$TRACE_ID" --query 'Traces[0]' --output json 2>/dev/null); then + echo "Error: Failed to fetch trace data. Check AWS permissions and trace ID." + exit 1 +fi + +if [ "$TRACE_DATA" == "null" ]; then + echo "Error: Trace ID not found" + exit 1 +fi + +# Extract start time from trace data and create time range (±5 minutes) +START_TIME=$(echo "$TRACE_DATA" | jq -r '.Segments[0].Document' | jq -r '.start_time') +START_TIME=${START_TIME%.*} # Remove decimal part +START_TIME=$((START_TIME - 300)) +END_TIME=$((START_TIME + 600)) + +# Convert to date format for display +if [[ "$OSTYPE" == "darwin"* ]]; then + # macOS date command + START_DATE=$(date -u -r "$START_TIME" '+%Y-%m-%d %H:%M:%S') + END_DATE=$(date -u -r "$END_TIME" '+%Y-%m-%d %H:%M:%S') +else + # Linux date command + START_DATE=$(date -u -d "@$START_TIME" '+%Y-%m-%d %H:%M:%S') + END_DATE=$(date -u -d "@$END_TIME" '+%Y-%m-%d %H:%M:%S') +fi + +# Get service map for the time range to find related services +echo "Using time range: $START_DATE to $END_DATE UTC" +echo "Fetching service map..." +SERVICE_MAP=$(aws xray get-service-graph --region "$AWS_REGION" --start-time "$START_TIME" --end-time "$END_TIME" --query 'Services[].Name' --output text) + +# Display service map visualization if requested +if [ "$SHOW_SERVICE_MAP" = "--service-map" ]; then + echo "" + echo "Service Map:" + echo "════════════" + aws xray get-service-graph --region "$AWS_REGION" --start-time "$START_TIME" --end-time "$END_TIME" --output json | \ + jq -r ' + .Services[] | + select(.Type != "client") | + (.Name | if length > 74 then .[0:71] + "..." else . end) as $shortName | + (.SummaryStatistics.TotalCount | tostring) as $requests | + ((.SummaryStatistics.TotalResponseTime / .SummaryStatistics.TotalCount * 1000 | floor) | tostring) + "ms" as $avgTime | + "┌──────────────────────────────────────────────────────────────────────────────┐\n" + + "│ " + ($shortName + (" " * (76 - ($shortName | length)))) + " │\n" + + "│ " + (.Type | if length > 76 then .[0:73] + "..." else . + (" " * (76 - (. | length))) end) + " │\n" + + "│ Requests: " + $requests + (" " * (67 - ($requests | length))) + "│\n" + + "│ Avg Time: " + $avgTime + (" " * (67 - ($avgTime | length))) + "│\n" + + "└──────────────────────────────────────────────────────────────────────────────┘\n" + + " ↓" + ' + echo "" +fi + +# Get all traces in the time window +echo "Fetching related traces..." +RELATED_TRACES=$(aws xray get-trace-summaries --region "$AWS_REGION" \ + --start-time "$START_TIME" \ + --end-time "$END_TIME" \ + --query 'TraceSummaries[].Id' \ + --output text) + +# Extract linked trace IDs from the original trace +echo "Extracting linked traces..." +LINKED_TRACES=$(echo "$TRACE_DATA" | jq -r '.Segments[].Document' | jq -r 'select(.links != null) | .links[].trace_id' 2>/dev/null | sort -u) + +# Combine original trace ID with linked traces +ALL_RELATED_TRACES="$TRACE_ID" +if [ -n "$LINKED_TRACES" ]; then + ALL_RELATED_TRACES="$ALL_RELATED_TRACES $LINKED_TRACES" +fi + +# Convert trace IDs to array and create filter (always include original trace ID) +TRACE_ARRAY=($ALL_RELATED_TRACES) +TRACE_FILTER="@message like /$TRACE_ID/" + +for trace in $LINKED_TRACES; do + if [ "$trace" != "$TRACE_ID" ]; then + TRACE_FILTER="$TRACE_FILTER or @message like /$trace/" + fi +done + +# Generate CloudWatch Logs Insights query +QUERY_STRING="SOURCE logGroups() +| fields @timestamp, @message +| filter $TRACE_FILTER +| sort @timestamp desc +| limit 1000" + +if [ "$RUN_QUERY" = "--run" ]; then + echo "Running CloudWatch Logs Insights query across all log groups..." + + if ! QUERY_ID=$(aws logs start-query \ + --region "$AWS_REGION" \ + --start-time "$START_TIME" \ + --end-time "$END_TIME" \ + --query-string "$QUERY_STRING" \ + --query 'queryId' \ + --output text 2>/dev/null); then + echo "Error: Failed to start CloudWatch Logs query. Check permissions." + exit 1 + fi + + echo "Query started with ID: $QUERY_ID" + echo "Waiting for query to complete..." + + # Wait for query to complete and fetch results with timeout + TIMEOUT=300 # 5 minutes + ELAPSED=0 + while [ $ELAPSED -lt $TIMEOUT ]; do + if ! STATUS=$(aws logs get-query-results --region "$AWS_REGION" --query-id "$QUERY_ID" --query 'status' --output text 2>/dev/null); then + echo "Error: Failed to check query status" + exit 1 + fi + + if [ "$STATUS" = "Complete" ]; then + echo "Query completed. Results:" + echo "========================" + aws logs get-query-results --region "$AWS_REGION" --query-id "$QUERY_ID" --query 'results' --output json 2>/dev/null | \ + jq -r '.[] | @base64' | while read -r line; do + echo "$line" | base64 -d | jq -r ' + . as $fields | + ($fields[] | select(.field == "@timestamp") | .value) as $timestamp | + ($fields[] | select(.field == "@message") | .value) as $message | + "\($timestamp) | \($message)" + ' + echo "---" + done + break + elif [ "$STATUS" = "Failed" ]; then + echo "Query failed" + exit 1 + else + sleep 2 + ELAPSED=$((ELAPSED + 2)) + fi + done + + if [ $ELAPSED -ge $TIMEOUT ]; then + echo "Error: Query timed out after $TIMEOUT seconds" + exit 1 + fi +else + cat << EOF + +CloudWatch Logs Insights Query: +================================ + +SOURCE logGroups() +| fields @timestamp, @message +| filter $TRACE_FILTER +| sort @timestamp desc +| limit 1000 + +Time Range: $START_DATE to $END_DATE UTC +Related Traces Found: ${#TRACE_ARRAY[@]} + +To run this query: +aws logs start-query \\ + --start-time $START_TIME \\ + --end-time $END_TIME \\ + --query-string '$QUERY_STRING' + +EOF +fi \ No newline at end of file