Align reward randomization bounds with GIGAFLOW paper by eugenevinitsky · Pull Request #401 · Emerge-Lab/PufferDrive

eugenevinitsky · 2026-04-12T22:40:25Z

Summary

Widens the reward randomization bounds in drive.ini to match the uniform ranges specified in GIGAFLOW (Cusumano-Towner et al.). Every randomized α was narrower than the spec on 3.0.
Pins the two coefficients GIGAFLOW specifies as fixed (velocity at 2.5e-3, timestep at 2.5e-5) so min == max.
Leaves overspeed unchanged ([-1.0, -0.9]) — it's a custom "GIGAFLOW++" term not in the paper, and the current magnitude is intentional.
No C/Python changes. Only pufferlib/config/ocean/drive.ini.

Sign convention caveat

Our C reward formulas in drive.h don't prepend a leading - to penalty coefficients (unlike vcha/turbostream). So GIGAFLOW ranges of U(0, X) for penalty terms are stored here as U(-X, 0) — this preserves the spec's effective reward while matching our code's convention.

Changed bounds

Term	Before	After	Source in spec
`collision`	`[-3.0, -0.1]`	`[-3.0, 0.0]`	`αcollision ∼ U(0, 3)`
`offroad`	`[-3.0, -0.1]`	`[-3.0, 0.0]`	`αboundary ∼ U(0, 3)`
`lane_align`	`[0.002, 0.0025]`	`[2.5e-4, 2.5e-2]`	`αl-align ∼ U(2.5e-4, 2.5e-2)`
`lane_center`	`[-7.5e-4, -6.5e-4]`	`[-7.5e-3, -2.5e-4]`	`αl-center ∼ U(2.5e-4, 7.5e-3)`
`velocity`	`[0.0, 0.005]`	pinned `[2.5e-3, 2.5e-3]`	`αvelocity = 2.5e-3` (fixed)
`center_bias`	`[-0.1, 0.1]`	`[-0.5, 0.5]`	`αcenter-bias ∼ U(-0.5, 0.5)`
`timestep`	`[-5e-5, 0.0]`	pinned `[-2.5e-5, -2.5e-5]`	`αtimestep = 2.5e-5` (fixed)

Already-matching bounds (unchanged)

goal_radius (U(2, 12)), comfort (U(0, 0.1)), traffic_light/stop-line (U(0, 1)), vel_align (U(0, 1)), reverse (U(2.5e-4, 7.5e-3)).

Not addressed (out of scope)

overspeed bounds — left at [-1.0, -0.9] per request.
max_goal_speed = 10.0 vs GIGAFLOW's v_goal = 3 — this is a structural difference (waypoint vs. every-goal semantics), not a numeric tweak to a bound.
velocity/timestep are also hardcoded in generate_reward_coefs() at drive.h:647-648 regardless of bounds; this PR just makes the bounds consistent with the hardcoded values for observation normalization purposes.
Missing R_stop-line firing logic in C (the metric is never computed, so traffic_light bounds are currently decorative).

Test plan

Build the C extension locally: python setup.py build_ext --inplace --force
Sanity-check reward_coefs landing in the observation are in the new ranges (e.g. lane_align coef in observations should now span much wider than before)
Launch a short cluster run (e.g. 100M steps) and confirm training is stable with the widened reward ranges
Compare reward-component logs (lane_alignment_rate, lane_center_rate, velocity_progress_sum, collision/offroad rates) against a run on plain 3.0 to verify per-component magnitudes move the expected direction

The reward bounds in drive.ini were narrower than the uniform ranges specified in GIGAFLOW (Cusumano-Towner et al.) for every randomized α. This widens them to match the paper, keeps the fixed coefficients (velocity, timestep) pinned at their spec values, and leaves overspeed alone since it's a custom "GIGAFLOW++" term. Sign convention: our C reward formulas don't prepend a leading `-` to penalty coefficients (unlike turbostream), so spec ranges U(0, X) are stored here as U(-X, 0). Changes: - collision: [-3.0, -0.1] -> [-3.0, 0.0] - offroad: [-3.0, -0.1] -> [-3.0, 0.0] - lane_align: [0.002, 0.0025] -> [2.5e-4, 2.5e-2] - lane_center: [-7.5e-4, -6.5e-4] -> [-7.5e-3, -2.5e-4] - velocity: [0.0, 0.005] -> pinned [2.5e-3, 2.5e-3] - center_bias: [-0.1, 0.1] -> [-0.5, 0.5] - timestep: [-5e-5, 0.0] -> pinned [-2.5e-5, -2.5e-5] Unchanged: - comfort, traffic_light, vel_align, reverse, goal_radius already matched the spec. - overspeed left at [-1.0, -0.9] (not in GIGAFLOW).

Copilot

Pull request overview

Updates the Ocean Drive environment’s reward-randomization bounds in drive.ini to match the uniform ranges specified in the GIGAFLOW paper (Cusumano-Towner et al.), including pinning the paper’s fixed coefficients while keeping the custom overspeed term unchanged.

Changes:

Widened multiple reward randomization bounds (collision/offroad, lane align/center, center bias) to match GIGAFLOW’s specified ranges (with sign mapping to match current C reward sign conventions).
Pinned velocity and timestep coefficients by setting min == max to the paper’s fixed values.
Added clarifying comments documenting the GIGAFLOW mapping and noting known limitations (e.g., stop-line/traffic-light metric not currently firing).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot AI review requested due to automatic review settings April 12, 2026 22:40

Copilot started reviewing on behalf of eugenevinitsky April 12, 2026 22:41 View session

Copilot AI reviewed Apr 12, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Align reward randomization bounds with GIGAFLOW paper#401

Align reward randomization bounds with GIGAFLOW paper#401
eugenevinitsky wants to merge 1 commit into3.0from
ev/gigaflow-reward-bounds

eugenevinitsky commented Apr 12, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

eugenevinitsky commented Apr 12, 2026

Summary

Sign convention caveat

Changed bounds

Already-matching bounds (unchanged)

Not addressed (out of scope)

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants