Skip to content

Align reward randomization bounds with GIGAFLOW paper#401

Open
eugenevinitsky wants to merge 1 commit into3.0from
ev/gigaflow-reward-bounds
Open

Align reward randomization bounds with GIGAFLOW paper#401
eugenevinitsky wants to merge 1 commit into3.0from
ev/gigaflow-reward-bounds

Conversation

@eugenevinitsky
Copy link
Copy Markdown

Summary

  • Widens the reward randomization bounds in drive.ini to match the uniform ranges specified in GIGAFLOW (Cusumano-Towner et al.). Every randomized α was narrower than the spec on 3.0.
  • Pins the two coefficients GIGAFLOW specifies as fixed (velocity at 2.5e-3, timestep at 2.5e-5) so min == max.
  • Leaves overspeed unchanged ([-1.0, -0.9]) — it's a custom "GIGAFLOW++" term not in the paper, and the current magnitude is intentional.
  • No C/Python changes. Only pufferlib/config/ocean/drive.ini.

Sign convention caveat

Our C reward formulas in drive.h don't prepend a leading - to penalty coefficients (unlike vcha/turbostream). So GIGAFLOW ranges of U(0, X) for penalty terms are stored here as U(-X, 0) — this preserves the spec's effective reward while matching our code's convention.

Changed bounds

Term Before After Source in spec
collision [-3.0, -0.1] [-3.0, 0.0] αcollision ∼ U(0, 3)
offroad [-3.0, -0.1] [-3.0, 0.0] αboundary ∼ U(0, 3)
lane_align [0.002, 0.0025] [2.5e-4, 2.5e-2] αl-align ∼ U(2.5e-4, 2.5e-2)
lane_center [-7.5e-4, -6.5e-4] [-7.5e-3, -2.5e-4] αl-center ∼ U(2.5e-4, 7.5e-3)
velocity [0.0, 0.005] pinned [2.5e-3, 2.5e-3] αvelocity = 2.5e-3 (fixed)
center_bias [-0.1, 0.1] [-0.5, 0.5] αcenter-bias ∼ U(-0.5, 0.5)
timestep [-5e-5, 0.0] pinned [-2.5e-5, -2.5e-5] αtimestep = 2.5e-5 (fixed)

Already-matching bounds (unchanged)

goal_radius (U(2, 12)), comfort (U(0, 0.1)), traffic_light/stop-line (U(0, 1)), vel_align (U(0, 1)), reverse (U(2.5e-4, 7.5e-3)).

Not addressed (out of scope)

  • overspeed bounds — left at [-1.0, -0.9] per request.
  • max_goal_speed = 10.0 vs GIGAFLOW's v_goal = 3 — this is a structural difference (waypoint vs. every-goal semantics), not a numeric tweak to a bound.
  • velocity/timestep are also hardcoded in generate_reward_coefs() at drive.h:647-648 regardless of bounds; this PR just makes the bounds consistent with the hardcoded values for observation normalization purposes.
  • Missing R_stop-line firing logic in C (the metric is never computed, so traffic_light bounds are currently decorative).

Test plan

  • Build the C extension locally: python setup.py build_ext --inplace --force
  • Sanity-check reward_coefs landing in the observation are in the new ranges (e.g. lane_align coef in observations should now span much wider than before)
  • Launch a short cluster run (e.g. 100M steps) and confirm training is stable with the widened reward ranges
  • Compare reward-component logs (lane_alignment_rate, lane_center_rate, velocity_progress_sum, collision/offroad rates) against a run on plain 3.0 to verify per-component magnitudes move the expected direction

The reward bounds in drive.ini were narrower than the uniform ranges
specified in GIGAFLOW (Cusumano-Towner et al.) for every randomized α.
This widens them to match the paper, keeps the fixed coefficients
(velocity, timestep) pinned at their spec values, and leaves overspeed
alone since it's a custom "GIGAFLOW++" term.

Sign convention: our C reward formulas don't prepend a leading `-` to
penalty coefficients (unlike turbostream), so spec ranges U(0, X) are
stored here as U(-X, 0).

Changes:
- collision:    [-3.0, -0.1]       -> [-3.0, 0.0]
- offroad:      [-3.0, -0.1]       -> [-3.0, 0.0]
- lane_align:   [0.002, 0.0025]    -> [2.5e-4, 2.5e-2]
- lane_center:  [-7.5e-4, -6.5e-4] -> [-7.5e-3, -2.5e-4]
- velocity:     [0.0, 0.005]       -> pinned [2.5e-3, 2.5e-3]
- center_bias:  [-0.1, 0.1]        -> [-0.5, 0.5]
- timestep:     [-5e-5, 0.0]       -> pinned [-2.5e-5, -2.5e-5]

Unchanged:
- comfort, traffic_light, vel_align, reverse, goal_radius already
  matched the spec.
- overspeed left at [-1.0, -0.9] (not in GIGAFLOW).
Copilot AI review requested due to automatic review settings April 12, 2026 22:40
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates the Ocean Drive environment’s reward-randomization bounds in drive.ini to match the uniform ranges specified in the GIGAFLOW paper (Cusumano-Towner et al.), including pinning the paper’s fixed coefficients while keeping the custom overspeed term unchanged.

Changes:

  • Widened multiple reward randomization bounds (collision/offroad, lane align/center, center bias) to match GIGAFLOW’s specified ranges (with sign mapping to match current C reward sign conventions).
  • Pinned velocity and timestep coefficients by setting min == max to the paper’s fixed values.
  • Added clarifying comments documenting the GIGAFLOW mapping and noting known limitations (e.g., stop-line/traffic-light metric not currently firing).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants