feat: t_min reward, discrete action space, seed sensitivity analysis#14
Merged
Conversation
Environment (AntiPendulumEnv):
- obs[3] changed from absolute load x-velocity to pure angular velocity
theta_dot = (cm_v[0] - origin_v[0]) / wire.length, independent of crane
translation. Improves PPO value estimation quality.
WARNING for Q-agent: existing Q-tables are invalid after this change
(observation space semantics changed) -- retrain from scratch with
continuous_actions=False explicitly passed to AntiPendulumEnv.
- Five new reward terms (all default 0.0, fully backward-compatible):
angle, angular_velocity, crane_velocity, crane_acceleration,
angular_acceleration.
- size parameter renamed to rail_limit (half-span of crane rail in metres).
- continuous_actions parameter (default False): when True, action space is
Box([-1, 1], shape=(1,)) -- the standard normalized action convention in
the Gymnasium/SB3 ecosystem. Policy networks end with tanh, so [-1, 1] is
the natural output range with no clipping needed. The environment does the
physical scaling: crane_acc = action[0] * acc (two-layer pattern: normalized
signal -> physical quantity). All Q-agent code passes continuous_actions=False
explicitly to preserve Discrete(3) backward compatibility.
Experiment config system (new: src/crane_controller/experiment_config.py):
- RewardConfig, TrainingConfig, ExperimentConfig frozen dataclasses replace
the opaque reward_fac tuple, eliminating the silent index-swap bug class.
- YAML experiment files in experiments/:
baseline.yaml -- original pre-refactor defaults for reference
hybrid_cv01.yaml -- validated config: energy=1.0, crane_velocity=0.1,
position=0.02, terminal_penalty=-5.0,
randomize_start=True
Seeds 42, 987, 5775 all achieve 6/6 OOD generalisation at start_speed=7.0
(7x the [0.1, 1.0] training range).
- JSON sidecar (*_meta.json) saved alongside every model; play_ppo.py reads
it automatically so reward config and flags follow the model without manual
CLI flags at playback time.
Scripts (train_ppo.py / play_ppo.py):
- New flags: --config PATH, --continuous-actions/--no-continuous-actions,
--randomize-start/--no-randomize-start, --start-speed, --seed,
--rail-limit, --resume-from.
- play_ppo.py pre-parses --model-path to auto-load sidecar defaults before
argparse, so --continuous-actions/--randomize-start default to the values
used during training.
Tests: 38 passing (16 new experiment_config tests; parametrized
discrete/continuous action-space and step tests).
README: updated obs description, action space docs, experiment config
section, pre-trained model table and play commands.
…ntinuous)
Trained with experiments/hybrid_cv01.yaml:
energy=1.0, crane_velocity=0.1, position=0.02,
terminal_penalty=-5.0, randomize_start=True,
3M steps, 32 parallel envs.
All six models achieve 6/6 OOD generalisation at start_speed=7.0
(7x the [0.1, 1.0] training range).
Each model bundle (3 files):
.zip -- policy network weights (SB3 PPO)
_vecnorm.pkl -- observation normalisation statistics (required for
correct inference; must match the .zip it was saved with)
_meta.json -- reward config + flags (read automatically by play_ppo.py)
Play any model:
uv run python scripts/play_ppo.py \
--model-path models/ppo_hcv01_discrete_s42.zip \
--episodes 3 --render-mode plot
uv run python scripts/play_ppo.py \
--model-path models/ppo_hcv01_continuous_s42.zip \
--episodes 3 --render-mode plot
OOD test (7x training range):
uv run python scripts/play_ppo.py \
--model-path models/ppo_hcv01_discrete_s42.zip \
--episodes 6 --render-mode plot --randomize-start --start-speed 7.0
Seeds: 42, 987, 5775 -- discrete and continuous variants of each (~145 KB per .zip).
Seed drawn from random.org (blind, before seeing results) to avoid
cherry-picking. Both discrete and continuous variants achieve 6/6 OOD
generalisation at start_speed=7.0 (7x the [0.1, 1.0] training range),
consistent with seeds 42, 987, and 5775.
Models (6 files, trained with experiments/hybrid_cv01.yaml, 3M steps):
models/ppo_hcv01_{discrete,continuous}_s827596.{zip,_vecnorm.pkl,_meta.json}
README: added 4 representative OOD episode figures (2 discrete, 2 continuous)
under the pre-trained models section to illustrate the crane damping behaviour
at 7x training range. Figures committed to assets/.
- experiment_config.py: from_dict parameter type dict[str,object] -> Mapping[str,object] (dict is invariant; Mapping is covariant, so dict[str,float] callers are accepted). Mapping moved into TYPE_CHECKING block per ruff TC003. Unused write_text() return assigned to _. - test_experiment_config.py: assign unused call results to _; suppress reportPrivateUsage on _meta_path import (internal utility, tested by design). - controlled_crane_pendulum.py: suppress reportUnknownMemberType on self.reward usages (type flows from py_crane which has incomplete stubs -- pre-existing). - test_environment.py: suppress reportUnknownMemberType on wire.cm_v and Discrete.n accesses (py_crane / gymnasium stubs). Remaining pyright issues (17 errors, 38 warnings) are all in controlled_mobile_crane.py (pygame stubs, pre-existing) and informational unnecessary-type-ignore comments in ppo_agent.py / test_ppo.py (pre-existing).
…-panel plots
- AntiPendulumEnv: new t_min_crane reward term (-t_min weight), _get_info() returns
t_min/x_pos/x_vel, show_plot() extended to 7 panels (angle, speed, crane pos/vel,
rewards, acceleration, t_min), render() accepts save_path
- RewardConfig: add t_min_crane, crane_velocity, angle, angular_velocity,
crane_acceleration, angular_acceleration fields
- TrainingConfig: add success_threshold field
- EpRewardLogCallback: new module (callbacks.py); 3-bucket classification
rail_hit/timelimit/success per interval; 13-column CSV log
- ppo_agent: thread success_threshold through __init__/load/resume/do_training;
do_one_episode() tracks t_min_trace and saves 7-panel PNG via render(save_path=)
- play_ppo.py: --save-png/--no-save-png flag; PNG named {stem}_play_ss{speed}_ep{n}.png
- train_ppo.py: --success-threshold CLI arg
- experiments: add sig_t_min.yaml and hybrid_t_min.yaml; update hybrid_cv01.yaml
- models: add hybrid_cv01_s42 and sig_t_min_s42 training logs (CSV)
- tests: add t_min_crane reward term test; add success_threshold assertions
- ppo_agent: EpisodeResult dataclass (start_speed, ep_steps, ep_reward,
terminated, truncated, success, t_min stats, x_pos/vel_final);
do_one_episode() returns EpisodeResult; success_threshold param defaults
to agent's training threshold
- play_ppo.py: --speed-sweep (runs [0.5,1.0,2.0,3.0,5.0,7.0,10.0]);
--save-csv/--no-save-csv (default True, writes {stem}_play_results.csv);
per-episode one-line summary logged to CLI
- do_one_episode(): read t_min_start from env.reset() info dict instead of t_min_trace[0] (which was post-step-1, not the true initial state) - play_ppo.py: print per-speed summary table after --speed-sweep (speed, n, success%, rew_mean, rew_std, t_min_final_mean)
Records the first step at which t_min reaches its episode minimum, indicating when the active damping phase ends and the crane settles.
… at plateau Previous definition (argmin) captured transient dips, which gave misleading results (e.g. speed=2.0 showed step 9 due to an early overshoot). New definition: last step where t_min > t_min_final + 0.05, so settle_step reflects when active damping truly ends rather than the earliest low t_min.
…weep table - no_crash = not terminated (rail stay); success = reward threshold (scale-dependent) - theta_final, theta_dot_final read from final obs[2], obs[3] - sweep table now shows nocrash%, succ%, settle_step, t_min_final, theta_final, thdot_final
Completes the multi-dimensional end-state quality vector: x_pos, x_vel, theta, theta_dot, t_min (existing) + pendulum KE and crane acceleration at last step (new). Sweep table drops succ% (scale- dependent) and adds x_pos_f, energy_f, acc_f as envelope metrics.
…rop t_min_f) x_pos shown in metres (4 d.p.), energy normalised to fraction of initial KE (0=perfect, comparable across speeds), t_min_f dropped from terminal output (stays in CSV). succ% dropped (scale-dependent, replaced by envelope metrics).
…l_f, rename acc_f→x_acc_f rew_mean divided by ep_steps; energy_frac first quality column; x_vel_f added; acc_f renamed x_acc_f for consistency with x_pos_m/x_vel_f naming.
These continuous and discrete model checkpoints (seeds 42, 987, 5775, 827596) were trained under the old naming convention and replaced by the hybrid_cv01 and sig_t_min runs on this branch.
…tivity section - rename reward_comparison_s5775.md → reward_comparison.md - update index.rst reference - add §5 seed-sensitivity section: hybrid_cv01_s42 (crash-free, 1.46 cm attractor) and sig_t_min_s42 (crash band at +6.6–+7.8 m/s, 93% crash-free) - expand §6.1 summary table to 4 columns covering both seeds - commit s42 sweep CSVs and three new detail/comparison plots
…splay plot_sweep.py _plot_comparison: 3-panel → 2×3 grid, adding θ, θ_dot and acc columns so physics state is fully visible alongside reward signal. plot_training.py: solid/dashed linestyles for multi-CSV overlay. AntiPendulumEnv.show_plot: drop t_min panel — reward internals should not appear in physics output (7 → 6 panels).
…5775) Training logs and play results for the discrete action space variant, needed for the cont-vs-disc comparison in reward_comparison.md.
Previous s42 log was missing physical metric columns (x_pos, theta, settle_step), preventing apples-to-apples comparison with the discrete variant. Retraining with the updated callback adds these columns. sig_t_min models are superseded by hybrid_cv01 with RewardConfig.
…space analysis Reorganise from a monolithic discrete section to cont-vs-disc side-by-side per seed. Add §2 simulation environment (physical params, PPO hyperparameters from hybrid_cv01.yaml). Key finding: Discrete(3) reaches machine-epsilon position (1e-14 to 1e-16 m) vs ~1 cm for Box(-1,1); settle range nearly identical across seeds, suggesting a physics-determined bang-bang sequence. Update and add figures for s42 and s5775; remove sig_t_min figures.
Five time-series plots (position, velocity, θ, θ_dot, acc, reward) for seed 42 — disc 1.0 m/s baseline, cont vs disc at 5.0 m/s, and the non-converging cont failure at 9.0 m/s resolved by disc at the same speed.
…cripts callbacks.py: guard ep_info_buffer None check before len() (mypy arg-type). ppo_agent.py: extend type: ignore to cover call-arg on render(save_path=). play_ppo.py: extract _log_sweep_table to reduce main() statement count (PLR0915); zip strict=True; ax.grid(visible=True). plot_sweep.py: zip strict=True; ax.grid(visible=True); removesuffix(); _ZERO_SPEED_THRESHOLD constant; docstring on main(). plot_training.py: zip strict=True; ax.grid(visible=True); removesuffix().
eisDNV
approved these changes
Jun 16, 2026
eisDNV
left a comment
Collaborator
There was a problem hiding this comment.
Good work. We start from here and align other branches with the result in the next step.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
RewardConfigdataclass and a configurablet_mincrane reward, replacing hard-coded reward weights with YAML-driven confighybrid_cv01with both continuousBox(-1,1)and discreteDiscrete(3)action spaces across seeds 42 and 5775 (3 M steps each)reward_comparison.md— a full analysis document covering training dynamics, speed sweep robustness, seed sensitivity, and action space comparisonKey finding
Switching to
Discrete(3)(bang-bang control) drops final crane position from 0.64–1.46 cm (continuous) to machine-epsilon level (1e-14–1e-16 m) across all 100 evaluated speeds and both seeds. Mean settle time roughly halves (95→49 s for s5775, 113→48 s for s42). The discrete policy also eliminates the 4 non-converging episodes present in continuous s42 at ±9.0 and ±3.4 m/s.Settle-step range is nearly identical across seeds (6–87 s disc s5775, 6–81 s disc s42), indicating the bang-bang solution is determined by the physics rather than the training trajectory.
Changes
src/crane_controller/experiment_config.py—RewardConfig+ YAML sidecarsrc/crane_controller/envs/controlled_crane_pendulum.py—t_minreward term,show_plot7→6 panelssrc/crane_controller/callbacks.py— mypy fix (None guard onep_info_buffer)src/crane_controller/ppo_agent.py—EpisodeResultextended; mypy/ruff fixesscripts/play_ppo.py— speed sweep, CSV export, PNG saving; ruff fixesscripts/plot_sweep.py— 3-panel → 6-panel comparison; ruff fixesscripts/plot_training.py— solid/dashed multi-CSV overlay; ruff fixesmodels/— hybrid_cv01 cont + disc results for seeds 42 and 5775docs/source/reward_comparison.md— full analysis (7 sections, 13 figures)docs/source/_static/— training curves, sweep plots, detail figures, episode galleryTest plan
uv run pytest— 43/43 passeduv run mypy— cleanuv run ruff check+ruff format --check— cleanuv run pyright— no new errors (pre-existing pygame-ce stubs issue unrelated to this branch)