Align reward randomization bounds with GIGAFLOW paper#401
Open
eugenevinitsky wants to merge 1 commit into3.0from
Open
Align reward randomization bounds with GIGAFLOW paper#401eugenevinitsky wants to merge 1 commit into3.0from
eugenevinitsky wants to merge 1 commit into3.0from
Conversation
The reward bounds in drive.ini were narrower than the uniform ranges specified in GIGAFLOW (Cusumano-Towner et al.) for every randomized α. This widens them to match the paper, keeps the fixed coefficients (velocity, timestep) pinned at their spec values, and leaves overspeed alone since it's a custom "GIGAFLOW++" term. Sign convention: our C reward formulas don't prepend a leading `-` to penalty coefficients (unlike turbostream), so spec ranges U(0, X) are stored here as U(-X, 0). Changes: - collision: [-3.0, -0.1] -> [-3.0, 0.0] - offroad: [-3.0, -0.1] -> [-3.0, 0.0] - lane_align: [0.002, 0.0025] -> [2.5e-4, 2.5e-2] - lane_center: [-7.5e-4, -6.5e-4] -> [-7.5e-3, -2.5e-4] - velocity: [0.0, 0.005] -> pinned [2.5e-3, 2.5e-3] - center_bias: [-0.1, 0.1] -> [-0.5, 0.5] - timestep: [-5e-5, 0.0] -> pinned [-2.5e-5, -2.5e-5] Unchanged: - comfort, traffic_light, vel_align, reverse, goal_radius already matched the spec. - overspeed left at [-1.0, -0.9] (not in GIGAFLOW).
There was a problem hiding this comment.
Pull request overview
Updates the Ocean Drive environment’s reward-randomization bounds in drive.ini to match the uniform ranges specified in the GIGAFLOW paper (Cusumano-Towner et al.), including pinning the paper’s fixed coefficients while keeping the custom overspeed term unchanged.
Changes:
- Widened multiple reward randomization bounds (collision/offroad, lane align/center, center bias) to match GIGAFLOW’s specified ranges (with sign mapping to match current C reward sign conventions).
- Pinned
velocityandtimestepcoefficients by settingmin == maxto the paper’s fixed values. - Added clarifying comments documenting the GIGAFLOW mapping and noting known limitations (e.g., stop-line/traffic-light metric not currently firing).
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
drive.inito match the uniform ranges specified in GIGAFLOW (Cusumano-Towner et al.). Every randomized α was narrower than the spec on3.0.velocityat2.5e-3,timestepat2.5e-5) somin == max.overspeedunchanged ([-1.0, -0.9]) — it's a custom "GIGAFLOW++" term not in the paper, and the current magnitude is intentional.pufferlib/config/ocean/drive.ini.Sign convention caveat
Our C reward formulas in
drive.hdon't prepend a leading-to penalty coefficients (unlikevcha/turbostream). So GIGAFLOW ranges ofU(0, X)for penalty terms are stored here asU(-X, 0)— this preserves the spec's effective reward while matching our code's convention.Changed bounds
collision[-3.0, -0.1][-3.0, 0.0]αcollision ∼ U(0, 3)offroad[-3.0, -0.1][-3.0, 0.0]αboundary ∼ U(0, 3)lane_align[0.002, 0.0025][2.5e-4, 2.5e-2]αl-align ∼ U(2.5e-4, 2.5e-2)lane_center[-7.5e-4, -6.5e-4][-7.5e-3, -2.5e-4]αl-center ∼ U(2.5e-4, 7.5e-3)velocity[0.0, 0.005][2.5e-3, 2.5e-3]αvelocity = 2.5e-3(fixed)center_bias[-0.1, 0.1][-0.5, 0.5]αcenter-bias ∼ U(-0.5, 0.5)timestep[-5e-5, 0.0][-2.5e-5, -2.5e-5]αtimestep = 2.5e-5(fixed)Already-matching bounds (unchanged)
goal_radius(U(2, 12)),comfort(U(0, 0.1)),traffic_light/stop-line (U(0, 1)),vel_align(U(0, 1)),reverse(U(2.5e-4, 7.5e-3)).Not addressed (out of scope)
overspeedbounds — left at[-1.0, -0.9]per request.max_goal_speed = 10.0vs GIGAFLOW'sv_goal = 3— this is a structural difference (waypoint vs. every-goal semantics), not a numeric tweak to a bound.velocity/timestepare also hardcoded ingenerate_reward_coefs()atdrive.h:647-648regardless of bounds; this PR just makes the bounds consistent with the hardcoded values for observation normalization purposes.R_stop-linefiring logic in C (the metric is never computed, sotraffic_lightbounds are currently decorative).Test plan
python setup.py build_ext --inplace --forcereward_coefslanding in the observation are in the new ranges (e.g.lane_aligncoef in observations should now span much wider than before)lane_alignment_rate,lane_center_rate,velocity_progress_sum, collision/offroad rates) against a run on plain3.0to verify per-component magnitudes move the expected direction