Skip to content

Re-running individual failed test jobs fails because shared infrastructure has been torn down #590

@MariusStorhaug

Description

Maintainers and contributors run the Process-PSModule.yml workflow on every pull request. The workflow consists of a BeforeAll-ModuleLocal job that provisions shared GitHub infrastructure (organization and user repositories named Test-{OS}-{TokenType}-{RunID}), a fan-out matrix of Test-ModuleLocal jobs (one per test file × OS × auth case), and an AfterAll-ModuleLocal job that tears everything down. The setup and teardown were introduced in #541 so that the matrix could share repositories instead of each leaf job creating its own.

Request

When a small number of leaf test jobs in the Test-ModuleLocal matrix fail because of intermittent issues (rate limits, transient API errors, GitHub-side flakiness), the only available recovery path on the GitHub Actions UI is Re-run failed jobs or Re-run all jobs. Neither path produces a working rerun for this workflow:

What happens

  1. The matrix completes with a mix of successful and failed leaf jobs (the screenshot on the originating discussion shows Test-Releases, Test-Repositories, and Test-Variables red while ~25 sibling jobs are green).
  2. AfterAll-ModuleLocal is configured with if: ... && always() and runs as soon as Test-ModuleLocal finishes, regardless of the matrix outcome. It deletes every Test-{OS}-{TokenType}-{RunID}* repository for the run.
  3. The user clicks Re-run failed jobs.
  4. GitHub Actions reruns only the failed leaf jobs in Test-ModuleLocal. It does not rerun BeforeAll-ModuleLocal because that job succeeded the first time, and skipped/successful needs are not re-executed by Re-run failed jobs.
  5. The rerun leaf jobs immediately fail in their per-context BeforeAll, where they call Get-GitHubRepository -Name "Test-{OS}-{TokenType}-{RunID}" against a repository that no longer exists.

What is expected

A user who wants to retry transiently failing leaf jobs can do so without having to rerun every successful sibling job. The shared infrastructure that the leaf jobs depend on must be present (or re-created) before the rerun executes, so the targeted retry succeeds when the underlying flake has cleared.

Reproduction steps

  1. Open a pull request against main so that Process-PSModule.yml runs end-to-end.
  2. Wait for the workflow to finish with at least one failed leaf job in the Test-ModuleLocal matrix and the AfterAll-ModuleLocal job completed (green).
  3. From the workflow run page, click Re-run failed jobs.
  4. Observe that every reran leaf job fails in its BeforeAll block while resolving the shared repository — the repository was deleted by AfterAll-ModuleLocal after step 2.

Regression

This behavior was introduced by #541. Before that change, every leaf test job created and deleted its own repositories using [guid]::NewGuid(), so reruns were self-contained — each leaf job rebuilt the infrastructure it needed. Consolidating to shared run-scoped infrastructure removed that property.

Workaround

The only working recovery today is Re-run all jobs, which re-executes every successful leaf job. For a workflow where a single matrix leg can take over two hours (Test-Apps (macOS) ran 1h 40m in the linked screenshot), this wastes substantial CI time and consumes additional GitHub API quota that the shared-repo design in #541 was specifically introduced to conserve.

Acceptance criteria

  • A user can trigger a retry of a specific failed leaf job (or a subset of failed leaf jobs) and have it succeed when the underlying transient cause has cleared, without rerunning every successful sibling.
  • The shared infrastructure the leaf jobs depend on is guaranteed to exist when those jobs execute, regardless of whether they are part of the original run or a partial rerun.
  • Successful leaf jobs from the original run are not re-executed unnecessarily on a partial retry.
  • Concurrent workflow runs continue to be isolated from each other (the run-scoped naming guarantee from #541 is preserved).
  • API call budget for a partial retry is bounded — a retry of one leaf job does not cost more than the leaf job itself plus whatever minimum setup is unavoidable.

Metadata

Metadata

Labels

Type

No type

Projects

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions