gpl: parallelize updateGradients and nesterovUpdateCoordinates#10451
gpl: parallelize updateGradients and nesterovUpdateCoordinates#10451maliberty wants to merge 2 commits into
Conversation
The OpenMP parallel-for in NesterovBase::updateGradients was disabled
because reduction(+: ...) over floats has an unspecified combine order
across threads, yielding bit-different sums between runs and between
thread counts. Tests rely on golden DEFs, so the loop ran serially.
Split the body in two:
1. Parallel per-cell phase computing wireLengthGrads[i], densityGrads[i],
and sumGrads[i] (independent writes to distinct indices). This is the
expensive work — WA gradient, density gradient, preconditioner.
2. Serial in-order phase accumulating wireLengthGradSum_,
densityGradSum_, and gradSum. Order matches the original loop, so
output is bit-identical to the prior serial version regardless of
thread count.
Verified by running convergence01.tcl at 1 and 8 threads and diffing the
resulting DEFs against each other and the committed golden — all three
identical. Full gpl ctest suite (62 tests) passes.
Signed-off-by: Matt Liberty <mliberty@precisioninno.com>
The axpy-and-clamp loop at the top of NesterovBase::nesterovUpdateCoordinates was serial. Each iteration writes to nextCoordi_[k] and nextSLPCoordi_[k] at distinct indices, with no cross-iteration dependencies and no float reduction, so it's trivially parallel. The clamping helpers (getDensityCoordiLayoutInsideX/Y) are const and read only bin-grid bounds. Output is bit-identical to the serial version regardless of thread count: convergence01.tcl produces the same DEF at 1 and 8 threads, matching the committed golden. Full gpl ctest suite (62 tests) passes. The loop runs once per backtracking retry inside doBackTracking, scaling linearly with cell count, so the win grows with design size. Signed-off-by: Matt Liberty <mliberty@precisioninno.com>
|
clang-tidy review says "All clean, LGTM! 👍" |
There was a problem hiding this comment.
Code Review
This pull request refactors the gradient update logic in NesterovBase::updateGradients into a two-phase process—parallel computation followed by serial reduction—to ensure deterministic results across different thread counts. Additionally, it parallelizes the coordinate update loop in nesterovUpdateCoordinates. I have no feedback to provide as the review comments were purely explanatory or did not identify actionable issues.
|
@codex review |
|
Codex Review: Didn't find any major issues. Already looking forward to the next diff. ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
Summary
Restores OpenMP parallelism to two hot per-iteration loops in
NesterovBase, with bit-identical output preserved.updateGradients(nesterovBase.cpp) — the OMP pragma had been disabled with a TODO explaining thatreduction(+: ...)over floats produced non-deterministic sums. The serial loop has been the dominant per-iteration cost ever since. Replaced with a two-phase pattern: parallel per-cell write of gradients + preconditioner into the existing output vectors, then a serial in-order accumulation that preserves the originalsum += fabs(x); sum += fabs(y);order. Result is bit-identical to the prior serial version regardless of thread count.nesterovUpdateCoordinates(nesterovBase.cpp) — the per-cell axpy + layout-clamp loop was missing#pragma omp parallel foreven though every iteration writes to distinct indices and uses only const helpers. Added the pragma. Bit-identical (independent writes, no reduction).Type of Change
Impact
2.0× wall-time speedup on
large01(274,700 cells) at 64 threads. Placement output is unchanged: every gpl ctest passes against its committed golden DEF, and benchmark runs at 1 thread, 8 threads, and 64 threads all produce identical HPWL.large01.tcl, 64 threads, 3 runs per condition:Final HPWL = 12,050,134,762 across all 9 runs — placement is bit-identical at every stage.
Verification
./etc/Build.sh).Full
gplctest suite: 62/62 passing. Each test diff-compares against a committed golden DEF, so the existing tests are themselves the bit-identical regression for this change. Additional ad-hoc determinism check:convergence01.tclproduces a DEF byte-for-byte identical to the committed golden at 1 thread and at 8 threads after each commit.Related Issues
None.