Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
40 changes: 40 additions & 0 deletions .github/workflows/marketplace-consistency.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
# Eventual-consistency workaround for GitHub Actions reliability issues on
# auto-update-marketplace PRs. Not a replacement for the normal publish flow —
# only unsticks PRs whose workflow runs failed to start or got hung up by
# GH-side flake. See agentic-marketplace/heal-stuck-prs/README.md.
name: Marketplace Consistency

on:
workflow_call:
inputs:
branch:
description: 'Branch used by publish/action.yml for auto-generated PRs.'
default: 'auto-update-marketplace'
type: string
label:
description: 'Label publish/action.yml sets on auto-generated PRs.'
default: 'automated'
type: string
stuck-threshold-seconds:
description: 'Age threshold (seconds) above which a PR/job is stuck.'
default: '90'
type: string
secrets:
token:
description: 'GITHUB_TOKEN is sufficient. Needs contents:write, pull-requests:write, actions:write.'
required: true

jobs:
heal:
runs-on: ubuntu-latest
permissions:
contents: write
pull-requests: write
actions: write
steps:
- uses: bitcomplete/bc-github-actions/agentic-marketplace/heal-stuck-prs@v1
with:
github-token: ${{ secrets.token }}
branch: ${{ inputs.branch }}
label: ${{ inputs.label }}
stuck-threshold-seconds: ${{ inputs.stuck-threshold-seconds }}
48 changes: 26 additions & 22 deletions agentic-marketplace/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -167,32 +167,28 @@ Generates marketplace.json and plugin.json files, then creates a pull request wi
The marketplace actions are configured via a TOML file at `.claude-plugin/generator.config.toml`:

```toml
# Naming pattern for components
naming_pattern = "^[a-z0-9]+(-[a-z0-9]+)*$" # kebab-case
[discovery]
# Directories the scanner will not enter.
excludeDirs = [".git", "node_modules", ".github", ".claude", "templates"]

# Reserved words that cannot appear in component names
reserved_words = ["anthropic", "claude"]
# Glob patterns to skip (case-insensitive).
excludePatterns = ["**/template/**", "OPENCODE.md", "CLAUDE.md"]

# Plugin discovery paths (glob patterns)
plugin_categories = ["code/**", "analysis/**", "communication/**"]
# Skill definition filename.
skillFilename = "SKILL.md"

# Component types to discover
[discovery]
plugins = true
commands = true
skills = true
agents = true
hooks = true
mcp_servers = true

# Validation rules
[validation]
require_description = true
require_version = true
min_description_length = 10
max_description_length = 200
# Kebab-case by default.
namePattern = "^[a-z0-9]+(-[a-z0-9]+)*$"
reservedWords = ["anthropic", "claude"]
nameMaxLength = 64
descriptionMaxLength = 1024
```

Discovery is excludes-only: components are any `<category>/<plugin-name>/` directory that isn't excluded. There is no include list. Any top-level directory that contains components becomes a category.

> **Deprecated:** older configs set `pluginCategories = ["code/**", ...]` as an include list. The field is still accepted but has no effect — discovery now uses `excludeDirs` / `excludePatterns` only. Remove it when you touch the config.

## Repository Structure

The marketplace action expects this structure:
Expand Down Expand Up @@ -224,7 +220,7 @@ your-marketplace/

### Discovery Process

The discover action scans your repository based on the plugin_categories patterns in your config:
The discover action walks your repository from root and gates only on `excludeDirs` / `excludePatterns`:

1. Finds all directories matching the two-level pattern: `category/plugin-name/`
2. Scans each plugin directory for:
Expand All @@ -236,6 +232,8 @@ The discover action scans your repository based on the plugin_categories pattern
3. Extracts metadata from YAML frontmatter in markdown files
4. Outputs discovered components as JSON

Validate and generate consume the exact same discovery output. A component that passes validate will appear in generate's marketplace.json by construction. Components found outside a `category/plugin-name/` path (e.g. at repo root) are orphans and fail both stages.

### Validation Process

The validate action checks each component against your configuration rules:
Expand Down Expand Up @@ -263,8 +261,8 @@ The generate action creates or updates marketplace files:
### No components discovered

Check that:
- Your `generator.config.toml` `plugin_categories` patterns match your directory structure
- Plugin directories follow the two-level pattern: `category/plugin-name/`
- The plugin's directory isn't covered by `excludeDirs` / `excludePatterns`
- Component files have proper YAML frontmatter

### Validation failures
Expand Down Expand Up @@ -334,6 +332,12 @@ Use validation output to implement custom logic:
echo '${{ steps.validate.outputs.errors }}'
```

## Reliability

### heal-stuck-prs

Works around GitHub Actions infrastructure flake on auto-update-marketplace PRs. Scans for PRs stuck with no checks attached, jobs stuck in `queued` state, or stalled auto-merge, and applies targeted recovery. Runs on a cron schedule — see [`heal-stuck-prs/README.md`](heal-stuck-prs/README.md). This is not a replacement for the normal publish flow; only reach for it when GH-side flake has left a PR hanging.

## Examples

See the [main README](../README.md) for complete workflow examples and diagrams.
73 changes: 73 additions & 0 deletions agentic-marketplace/heal-stuck-prs/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
# Heal Stuck Marketplace PRs

**This action exists to work around GitHub Actions reliability issues.** It is not a general automation layer, does not replace the normal `agentic-marketplace` publish flow, and should not be reached for when the normal workflow could already handle the situation.

## What it does

Scans the repo it runs in for open PRs on the `auto-update-marketplace` branch (the one [`agentic-marketplace/publish`](../publish/action.yml) creates) that have gotten stuck due to GH-side flake, and applies a narrowly targeted recovery:

| Classification | Symptom | Remediation |
|---|---|---|
| **No workflow runs** | PR has no runs attached to its head SHA and is older than the threshold | Push an empty commit to the PR branch to retrigger `pull_request` workflows; re-issue auto-merge |
| **Queued jobs** | Any job older than the threshold with `started_at = null` | Cancel the stuck run, re-run it, re-issue auto-merge |
| **Stalled auto-merge** | All checks green but `autoMergeRequest` is unset | Re-issue `gh pr merge --auto --squash` |
| **Healthy** | Everything in the expected envelope | Log, skip |

## When it runs

Intended to be invoked on a cron schedule (every 30 min is a sensible default) via the [`marketplace-consistency.yml`](../../.github/workflows/marketplace-consistency.yml) reusable workflow. Runs are lightweight and no-op on healthy repos.

## Stuck threshold

Defaults to **90 seconds**. Derived by sampling `started_at − created_at` (runner-wait time) across a spread of recent healthy `Update Agentic Marketplace` runs on a reference repo — the max observed was 9s. 10× the ceiling gives a threshold that is comfortably outside the healthy operating envelope.

Override via `stuck-threshold-seconds` if your repo's workflows have different queue-time characteristics.

## Inputs

| Name | Required | Default | Description |
|---|---|---|---|
| `github-token` | yes | — | `GITHUB_TOKEN` is sufficient. Needs `contents:write`, `pull-requests:write`, `actions:write`. |
| `branch` | no | `auto-update-marketplace` | Branch that `publish/action.yml` uses for auto-generated PRs. |
| `label` | no | `automated` | Label that `publish/action.yml` puts on auto-generated PRs. |
| `stuck-threshold-seconds` | no | `90` | Age above which a PR/job is considered stuck. |

## Usage

Direct:

```yaml
- uses: bitcomplete/bc-github-actions/agentic-marketplace/heal-stuck-prs@v1
with:
github-token: ${{ secrets.GITHUB_TOKEN }}
```

Via the reusable workflow (recommended):

```yaml
name: Marketplace Consistency Cron
on:
schedule:
- cron: '*/30 * * * *'
workflow_dispatch:

permissions:
contents: write
pull-requests: write
actions: write

jobs:
heal:
uses: bitcomplete/bc-github-actions/.github/workflows/marketplace-consistency.yml@v1
secrets:
token: ${{ secrets.GITHUB_TOKEN }}
```

## Why this exists

When GitHub Actions infrastructure is healthy, `agentic-marketplace/publish` opens the auto-update PR, the PR's `pull_request` workflow picks up, checks go green, auto-merge fires — all in under a minute. When infrastructure is flaky, one of two things happens:

1. The `pull_request` workflow never triggers, so the PR has zero checks and auto-merge can't evaluate. Previously this left the PR blocked until someone noticed and closed it manually (see historical PRs #39, #40).
2. A job enters `queued` state and never gets a runner. The PR sits indefinitely waiting for checks that will never complete.

This action catches both cases after the fact, without replacing any of the normal flow.
119 changes: 119 additions & 0 deletions agentic-marketplace/heal-stuck-prs/action.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,119 @@
name: 'Heal Stuck Marketplace PRs'
description: 'Eventual-consistency workaround for GitHub Actions reliability flake on auto-update-marketplace PRs. Not a general automation layer.'
author: 'Bitcomplete'

inputs:
github-token:
description: 'GitHub token (GITHUB_TOKEN is fine — needs contents:write, pull-requests:write, actions:write).'
required: true
branch:
description: 'Branch name used by publish/action.yml for auto-generated PRs.'
required: false
default: 'auto-update-marketplace'
label:
description: 'Label used by publish/action.yml to mark auto-generated PRs.'
required: false
default: 'automated'
stuck-threshold-seconds:
description: 'A PR or job older than this (seconds) with no check runs, or with jobs still in queued/pending state, is considered stuck. 90s = 10x the observed runner-wait ceiling in healthy runs.'
required: false
default: '90'

runs:
using: 'composite'
steps:
- name: Heal stuck PRs
shell: bash
env:
GH_TOKEN: ${{ inputs.github-token }}
HEAL_BRANCH: ${{ inputs.branch }}
HEAL_LABEL: ${{ inputs.label }}
HEAL_THRESHOLD: ${{ inputs.stuck-threshold-seconds }}
run: |
set -euo pipefail

REPO="${GITHUB_REPOSITORY}"
NOW=$(date -u +%s)

echo "heal-stuck-prs: scanning $REPO for open PRs on branch=$HEAL_BRANCH label=$HEAL_LABEL (threshold=${HEAL_THRESHOLD}s)"

# Enumerate candidate PRs: open, on our auto-update branch, with our label.
PRS=$(gh pr list \
--repo "$REPO" \
--state open \
--json number,createdAt,headRefName,headRefOid,labels,autoMergeRequest \
--jq "[.[] | select(.headRefName == \"$HEAL_BRANCH\" and ((.labels // []) | map(.name) | index(\"$HEAL_LABEL\")))]")

COUNT=$(echo "$PRS" | jq 'length')
if [ "$COUNT" = "0" ]; then
echo "heal-stuck-prs: no candidate PRs — nothing to heal"
exit 0
fi

echo "heal-stuck-prs: $COUNT candidate PR(s) to classify"

echo "$PRS" | jq -c '.[]' | while read -r PR; do
PR_NUM=$(echo "$PR" | jq -r '.number')
PR_SHA=$(echo "$PR" | jq -r '.headRefOid')
PR_CREATED=$(echo "$PR" | jq -r '.createdAt')
PR_AUTO_MERGE=$(echo "$PR" | jq -r '.autoMergeRequest // "null"')
PR_AGE=$(( NOW - $(date -u -d "$PR_CREATED" +%s) ))

# Jobs for all workflow runs associated with this head SHA.
RUNS=$(gh api "repos/$REPO/actions/runs?head_sha=$PR_SHA&per_page=100" \
--jq '[.workflow_runs[] | {id, status, conclusion}]' 2>/dev/null || echo '[]')
RUN_COUNT=$(echo "$RUNS" | jq 'length')

# Aggregate job state: count stuck-in-queue jobs across all runs.
STUCK_JOBS=0
STUCK_RUN_ID=""
if [ "$RUN_COUNT" -gt 0 ]; then
for RUN_ID in $(echo "$RUNS" | jq -r '.[].id'); do
JOBS=$(gh api "repos/$REPO/actions/runs/$RUN_ID/jobs" --jq '[.jobs[] | {status, created_at, started_at}]' 2>/dev/null || echo '[]')
N_STUCK=$(echo "$JOBS" | jq --argjson threshold "$HEAL_THRESHOLD" --argjson now "$NOW" '
[.[] | select(.started_at == null and .status != "completed")
| select(($now - (.created_at | fromdateiso8601)) > $threshold)] | length')
if [ "$N_STUCK" -gt 0 ]; then
STUCK_JOBS=$(( STUCK_JOBS + N_STUCK ))
STUCK_RUN_ID="$RUN_ID"
fi
done
fi

# Classify and act.
if [ "$RUN_COUNT" = "0" ] && [ "$PR_AGE" -gt "$HEAL_THRESHOLD" ]; then
echo "heal-stuck-prs PR #$PR_NUM: [stuck — no workflow runs for head SHA, age=${PR_AGE}s] pushing empty commit to retrigger"
TMPDIR=$(mktemp -d)
git clone --depth 1 --branch "$HEAL_BRANCH" "https://x-access-token:$GH_TOKEN@github.com/$REPO.git" "$TMPDIR/repo"
pushd "$TMPDIR/repo" >/dev/null
git -c user.email="github-actions[bot]@users.noreply.github.com" -c user.name="github-actions[bot]" \
commit --allow-empty -m "chore: retrigger workflows (heal-stuck-prs)"
git push origin "$HEAL_BRANCH"
popd >/dev/null
rm -rf "$TMPDIR"
gh pr merge "$PR_NUM" --repo "$REPO" --auto --squash || echo "heal-stuck-prs PR #$PR_NUM: auto-merge re-enable deferred (checks not attached yet)"
continue
fi

if [ "$STUCK_JOBS" -gt 0 ]; then
echo "heal-stuck-prs PR #$PR_NUM: [stuck — $STUCK_JOBS job(s) queued >${HEAL_THRESHOLD}s] cancelling run $STUCK_RUN_ID and re-running"
gh run cancel "$STUCK_RUN_ID" --repo "$REPO" || true
gh run rerun "$STUCK_RUN_ID" --repo "$REPO" --failed || gh run rerun "$STUCK_RUN_ID" --repo "$REPO" || true
gh pr merge "$PR_NUM" --repo "$REPO" --auto --squash || true
continue
fi

if [ "$PR_AUTO_MERGE" = "null" ]; then
echo "heal-stuck-prs PR #$PR_NUM: [stalled — auto-merge not set] re-enabling"
gh pr merge "$PR_NUM" --repo "$REPO" --auto --squash || echo "heal-stuck-prs PR #$PR_NUM: could not enable auto-merge"
continue
fi

echo "heal-stuck-prs PR #$PR_NUM: [healthy] skip (age=${PR_AGE}s, runs=$RUN_COUNT)"
done

echo "heal-stuck-prs: done"

branding:
icon: 'activity'
color: 'yellow'
Loading
Loading