Maintainers and contributors run the Process-PSModule.yml workflow on every pull request. The workflow consists of a BeforeAll-ModuleLocal job that provisions shared GitHub infrastructure (organization and user repositories named Test-{OS}-{TokenType}-{RunID}), a fan-out matrix of Test-ModuleLocal jobs (one per test file × OS × auth case), and an AfterAll-ModuleLocal job that tears everything down. The setup and teardown were introduced in #541 so that the matrix could share repositories instead of each leaf job creating its own.
Request
When a small number of leaf test jobs in the Test-ModuleLocal matrix fail because of intermittent issues (rate limits, transient API errors, GitHub-side flakiness), the only available recovery path on the GitHub Actions UI is Re-run failed jobs or Re-run all jobs. Neither path produces a working rerun for this workflow:
What happens
- The matrix completes with a mix of successful and failed leaf jobs (the screenshot on the originating discussion shows
Test-Releases, Test-Repositories, and Test-Variables red while ~25 sibling jobs are green).
AfterAll-ModuleLocal is configured with if: ... && always() and runs as soon as Test-ModuleLocal finishes, regardless of the matrix outcome. It deletes every Test-{OS}-{TokenType}-{RunID}* repository for the run.
- The user clicks Re-run failed jobs.
- GitHub Actions reruns only the failed leaf jobs in
Test-ModuleLocal. It does not rerun BeforeAll-ModuleLocal because that job succeeded the first time, and skipped/successful needs are not re-executed by Re-run failed jobs.
- The rerun leaf jobs immediately fail in their per-context
BeforeAll, where they call Get-GitHubRepository -Name "Test-{OS}-{TokenType}-{RunID}" against a repository that no longer exists.
What is expected
A user who wants to retry transiently failing leaf jobs can do so without having to rerun every successful sibling job. The shared infrastructure that the leaf jobs depend on must be present (or re-created) before the rerun executes, so the targeted retry succeeds when the underlying flake has cleared.
Reproduction steps
- Open a pull request against
main so that Process-PSModule.yml runs end-to-end.
- Wait for the workflow to finish with at least one failed leaf job in the
Test-ModuleLocal matrix and the AfterAll-ModuleLocal job completed (green).
- From the workflow run page, click Re-run failed jobs.
- Observe that every reran leaf job fails in its
BeforeAll block while resolving the shared repository — the repository was deleted by AfterAll-ModuleLocal after step 2.
Regression
This behavior was introduced by #541. Before that change, every leaf test job created and deleted its own repositories using [guid]::NewGuid(), so reruns were self-contained — each leaf job rebuilt the infrastructure it needed. Consolidating to shared run-scoped infrastructure removed that property.
Workaround
The only working recovery today is Re-run all jobs, which re-executes every successful leaf job. For a workflow where a single matrix leg can take over two hours (Test-Apps (macOS) ran 1h 40m in the linked screenshot), this wastes substantial CI time and consumes additional GitHub API quota that the shared-repo design in #541 was specifically introduced to conserve.
Acceptance criteria
- A user can trigger a retry of a specific failed leaf job (or a subset of failed leaf jobs) and have it succeed when the underlying transient cause has cleared, without rerunning every successful sibling.
- The shared infrastructure the leaf jobs depend on is guaranteed to exist when those jobs execute, regardless of whether they are part of the original run or a partial rerun.
- Successful leaf jobs from the original run are not re-executed unnecessarily on a partial retry.
- Concurrent workflow runs continue to be isolated from each other (the run-scoped naming guarantee from #541 is preserved).
- API call budget for a partial retry is bounded — a retry of one leaf job does not cost more than the leaf job itself plus whatever minimum setup is unavoidable.
Maintainers and contributors run the
Process-PSModule.ymlworkflow on every pull request. The workflow consists of aBeforeAll-ModuleLocaljob that provisions shared GitHub infrastructure (organization and user repositories namedTest-{OS}-{TokenType}-{RunID}), a fan-out matrix ofTest-ModuleLocaljobs (one per test file × OS × auth case), and anAfterAll-ModuleLocaljob that tears everything down. The setup and teardown were introduced in #541 so that the matrix could share repositories instead of each leaf job creating its own.Request
When a small number of leaf test jobs in the
Test-ModuleLocalmatrix fail because of intermittent issues (rate limits, transient API errors, GitHub-side flakiness), the only available recovery path on the GitHub Actions UI is Re-run failed jobs or Re-run all jobs. Neither path produces a working rerun for this workflow:What happens
Test-Releases,Test-Repositories, andTest-Variablesred while ~25 sibling jobs are green).AfterAll-ModuleLocalis configured withif: ... && always()and runs as soon asTest-ModuleLocalfinishes, regardless of the matrix outcome. It deletes everyTest-{OS}-{TokenType}-{RunID}*repository for the run.Test-ModuleLocal. It does not rerunBeforeAll-ModuleLocalbecause that job succeeded the first time, and skipped/successfulneedsare not re-executed by Re-run failed jobs.BeforeAll, where they callGet-GitHubRepository -Name "Test-{OS}-{TokenType}-{RunID}"against a repository that no longer exists.What is expected
A user who wants to retry transiently failing leaf jobs can do so without having to rerun every successful sibling job. The shared infrastructure that the leaf jobs depend on must be present (or re-created) before the rerun executes, so the targeted retry succeeds when the underlying flake has cleared.
Reproduction steps
mainso thatProcess-PSModule.ymlruns end-to-end.Test-ModuleLocalmatrix and theAfterAll-ModuleLocaljob completed (green).BeforeAllblock while resolving the shared repository — the repository was deleted byAfterAll-ModuleLocalafter step 2.Regression
This behavior was introduced by #541. Before that change, every leaf test job created and deleted its own repositories using
[guid]::NewGuid(), so reruns were self-contained — each leaf job rebuilt the infrastructure it needed. Consolidating to shared run-scoped infrastructure removed that property.Workaround
The only working recovery today is Re-run all jobs, which re-executes every successful leaf job. For a workflow where a single matrix leg can take over two hours (
Test-Apps (macOS)ran 1h 40m in the linked screenshot), this wastes substantial CI time and consumes additional GitHub API quota that the shared-repo design in #541 was specifically introduced to conserve.Acceptance criteria