Re-running individual failed test jobs fails because shared infrastructure has been torn down

Maintainers and contributors run the [`Process-PSModule.yml`](https://github.com/PSModule/GitHub/blob/main/.github/workflows/Process-PSModule.yml) workflow on every pull request. The workflow consists of a `BeforeAll-ModuleLocal` job that provisions shared GitHub infrastructure (organization and user repositories named `Test-{OS}-{TokenType}-{RunID}`), a fan-out matrix of `Test-ModuleLocal` jobs (one per test file × OS × auth case), and an `AfterAll-ModuleLocal` job that tears everything down. The setup and teardown were introduced in [#541](https://github.com/PSModule/GitHub/pull/541) so that the matrix could share repositories instead of each leaf job creating its own.

## Request

When a small number of leaf test jobs in the `Test-ModuleLocal` matrix fail because of intermittent issues (rate limits, transient API errors, GitHub-side flakiness), the only available recovery path on the GitHub Actions UI is **Re-run failed jobs** or **Re-run all jobs**. Neither path produces a working rerun for this workflow:

### What happens

1. The matrix completes with a mix of successful and failed leaf jobs (the screenshot on the originating discussion shows `Test-Releases`, `Test-Repositories`, and `Test-Variables` red while ~25 sibling jobs are green).
2. `AfterAll-ModuleLocal` is configured with `if: ... && always()` and runs as soon as `Test-ModuleLocal` finishes, regardless of the matrix outcome. It deletes every `Test-{OS}-{TokenType}-{RunID}*` repository for the run.
3. The user clicks **Re-run failed jobs**.
4. GitHub Actions reruns only the failed leaf jobs in `Test-ModuleLocal`. It does **not** rerun `BeforeAll-ModuleLocal` because that job succeeded the first time, and skipped/successful `needs` are not re-executed by **Re-run failed jobs**.
5. The rerun leaf jobs immediately fail in their per-context `BeforeAll`, where they call `Get-GitHubRepository -Name "Test-{OS}-{TokenType}-{RunID}"` against a repository that no longer exists.

### What is expected

A user who wants to retry transiently failing leaf jobs can do so without having to rerun every successful sibling job. The shared infrastructure that the leaf jobs depend on must be present (or re-created) before the rerun executes, so the targeted retry succeeds when the underlying flake has cleared.

### Reproduction steps

1. Open a pull request against `main` so that `Process-PSModule.yml` runs end-to-end.
2. Wait for the workflow to finish with at least one failed leaf job in the `Test-ModuleLocal` matrix and the `AfterAll-ModuleLocal` job completed (green).
3. From the workflow run page, click **Re-run failed jobs**.
4. Observe that every reran leaf job fails in its `BeforeAll` block while resolving the shared repository — the repository was deleted by `AfterAll-ModuleLocal` after step 2.

### Regression

This behavior was introduced by [#541](https://github.com/PSModule/GitHub/pull/541). Before that change, every leaf test job created and deleted its own repositories using `[guid]::NewGuid()`, so reruns were self-contained — each leaf job rebuilt the infrastructure it needed. Consolidating to shared run-scoped infrastructure removed that property.

### Workaround

The only working recovery today is **Re-run all jobs**, which re-executes every successful leaf job. For a workflow where a single matrix leg can take over two hours (`Test-Apps (macOS)` ran 1h 40m in the linked screenshot), this wastes substantial CI time and consumes additional GitHub API quota that the shared-repo design in [#541](https://github.com/PSModule/GitHub/pull/541) was specifically introduced to conserve.

### Acceptance criteria

- A user can trigger a retry of a specific failed leaf job (or a subset of failed leaf jobs) and have it succeed when the underlying transient cause has cleared, without rerunning every successful sibling.
- The shared infrastructure the leaf jobs depend on is guaranteed to exist when those jobs execute, regardless of whether they are part of the original run or a partial rerun.
- Successful leaf jobs from the original run are not re-executed unnecessarily on a partial retry.
- Concurrent workflow runs continue to be isolated from each other (the run-scoped naming guarantee from [#541](https://github.com/PSModule/GitHub/pull/541) is preserved).
- API call budget for a partial retry is bounded — a retry of one leaf job does not cost more than the leaf job itself plus whatever minimum setup is unavoidable.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Re-running individual failed test jobs fails because shared infrastructure has been torn down #590

Request

What happens

What is expected

Reproduction steps

Regression

Workaround

Acceptance criteria

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Re-running individual failed test jobs fails because shared infrastructure has been torn down #590

Description

Request

What happens

What is expected

Reproduction steps

Regression

Workaround

Acceptance criteria

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions