feat: t_min reward, discrete action space, seed sensitivity analysis by aleksandarbabicdnv · Pull Request #14 · dnv-opensource/crane-controller

aleksandarbabicdnv · 2026-06-16T05:48:10Z

Summary

Introduces RewardConfig dataclass and a configurable t_min crane reward, replacing hard-coded reward weights with YAML-driven config
Trains hybrid_cv01 with both continuous Box(-1,1) and discrete Discrete(3) action spaces across seeds 42 and 5775 (3 M steps each)
Adds reward_comparison.md — a full analysis document covering training dynamics, speed sweep robustness, seed sensitivity, and action space comparison
Adds episode trajectory gallery (§6.4) showing cont vs disc behaviour at 1.0, 5.0, and 9.0 m/s

Key finding

Switching to Discrete(3) (bang-bang control) drops final crane position from 0.64–1.46 cm (continuous) to machine-epsilon level (1e-14–1e-16 m) across all 100 evaluated speeds and both seeds. Mean settle time roughly halves (95→49 s for s5775, 113→48 s for s42). The discrete policy also eliminates the 4 non-converging episodes present in continuous s42 at ±9.0 and ±3.4 m/s.

Settle-step range is nearly identical across seeds (6–87 s disc s5775, 6–81 s disc s42), indicating the bang-bang solution is determined by the physics rather than the training trajectory.

Changes

src/crane_controller/experiment_config.py — RewardConfig + YAML sidecar
src/crane_controller/envs/controlled_crane_pendulum.py — t_min reward term, show_plot 7→6 panels
src/crane_controller/callbacks.py — mypy fix (None guard on ep_info_buffer)
src/crane_controller/ppo_agent.py — EpisodeResult extended; mypy/ruff fixes
scripts/play_ppo.py — speed sweep, CSV export, PNG saving; ruff fixes
scripts/plot_sweep.py — 3-panel → 6-panel comparison; ruff fixes
scripts/plot_training.py — solid/dashed multi-CSV overlay; ruff fixes
models/ — hybrid_cv01 cont + disc results for seeds 42 and 5775
docs/source/reward_comparison.md — full analysis (7 sections, 13 figures)
docs/source/_static/ — training curves, sweep plots, detail figures, episode gallery

Test plan

uv run pytest — 43/43 passed
uv run mypy — clean
uv run ruff check + ruff format --check — clean
uv run pyright — no new errors (pre-existing pygame-ce stubs issue unrelated to this branch)

Environment (AntiPendulumEnv): - obs[3] changed from absolute load x-velocity to pure angular velocity theta_dot = (cm_v[0] - origin_v[0]) / wire.length, independent of crane translation. Improves PPO value estimation quality. WARNING for Q-agent: existing Q-tables are invalid after this change (observation space semantics changed) -- retrain from scratch with continuous_actions=False explicitly passed to AntiPendulumEnv. - Five new reward terms (all default 0.0, fully backward-compatible): angle, angular_velocity, crane_velocity, crane_acceleration, angular_acceleration. - size parameter renamed to rail_limit (half-span of crane rail in metres). - continuous_actions parameter (default False): when True, action space is Box([-1, 1], shape=(1,)) -- the standard normalized action convention in the Gymnasium/SB3 ecosystem. Policy networks end with tanh, so [-1, 1] is the natural output range with no clipping needed. The environment does the physical scaling: crane_acc = action[0] * acc (two-layer pattern: normalized signal -> physical quantity). All Q-agent code passes continuous_actions=False explicitly to preserve Discrete(3) backward compatibility. Experiment config system (new: src/crane_controller/experiment_config.py): - RewardConfig, TrainingConfig, ExperimentConfig frozen dataclasses replace the opaque reward_fac tuple, eliminating the silent index-swap bug class. - YAML experiment files in experiments/: baseline.yaml -- original pre-refactor defaults for reference hybrid_cv01.yaml -- validated config: energy=1.0, crane_velocity=0.1, position=0.02, terminal_penalty=-5.0, randomize_start=True Seeds 42, 987, 5775 all achieve 6/6 OOD generalisation at start_speed=7.0 (7x the [0.1, 1.0] training range). - JSON sidecar (*_meta.json) saved alongside every model; play_ppo.py reads it automatically so reward config and flags follow the model without manual CLI flags at playback time. Scripts (train_ppo.py / play_ppo.py): - New flags: --config PATH, --continuous-actions/--no-continuous-actions, --randomize-start/--no-randomize-start, --start-speed, --seed, --rail-limit, --resume-from. - play_ppo.py pre-parses --model-path to auto-load sidecar defaults before argparse, so --continuous-actions/--randomize-start default to the values used during training. Tests: 38 passing (16 new experiment_config tests; parametrized discrete/continuous action-space and step tests). README: updated obs description, action space docs, experiment config section, pre-trained model table and play commands.

…ntinuous) Trained with experiments/hybrid_cv01.yaml: energy=1.0, crane_velocity=0.1, position=0.02, terminal_penalty=-5.0, randomize_start=True, 3M steps, 32 parallel envs. All six models achieve 6/6 OOD generalisation at start_speed=7.0 (7x the [0.1, 1.0] training range). Each model bundle (3 files): .zip -- policy network weights (SB3 PPO) _vecnorm.pkl -- observation normalisation statistics (required for correct inference; must match the .zip it was saved with) _meta.json -- reward config + flags (read automatically by play_ppo.py) Play any model: uv run python scripts/play_ppo.py \ --model-path models/ppo_hcv01_discrete_s42.zip \ --episodes 3 --render-mode plot uv run python scripts/play_ppo.py \ --model-path models/ppo_hcv01_continuous_s42.zip \ --episodes 3 --render-mode plot OOD test (7x training range): uv run python scripts/play_ppo.py \ --model-path models/ppo_hcv01_discrete_s42.zip \ --episodes 6 --render-mode plot --randomize-start --start-speed 7.0 Seeds: 42, 987, 5775 -- discrete and continuous variants of each (~145 KB per .zip).

Seed drawn from random.org (blind, before seeing results) to avoid cherry-picking. Both discrete and continuous variants achieve 6/6 OOD generalisation at start_speed=7.0 (7x the [0.1, 1.0] training range), consistent with seeds 42, 987, and 5775. Models (6 files, trained with experiments/hybrid_cv01.yaml, 3M steps): models/ppo_hcv01_{discrete,continuous}_s827596.{zip,_vecnorm.pkl,_meta.json} README: added 4 representative OOD episode figures (2 discrete, 2 continuous) under the pre-trained models section to illustrate the crane damping behaviour at 7x training range. Figures committed to assets/.

- experiment_config.py: from_dict parameter type dict[str,object] -> Mapping[str,object] (dict is invariant; Mapping is covariant, so dict[str,float] callers are accepted). Mapping moved into TYPE_CHECKING block per ruff TC003. Unused write_text() return assigned to _. - test_experiment_config.py: assign unused call results to _; suppress reportPrivateUsage on _meta_path import (internal utility, tested by design). - controlled_crane_pendulum.py: suppress reportUnknownMemberType on self.reward usages (type flows from py_crane which has incomplete stubs -- pre-existing). - test_environment.py: suppress reportUnknownMemberType on wire.cm_v and Discrete.n accesses (py_crane / gymnasium stubs). Remaining pyright issues (17 errors, 38 warnings) are all in controlled_mobile_crane.py (pygame stubs, pre-existing) and informational unnecessary-type-ignore comments in ppo_agent.py / test_ppo.py (pre-existing).

…ypy)

…-panel plots - AntiPendulumEnv: new t_min_crane reward term (-t_min weight), _get_info() returns t_min/x_pos/x_vel, show_plot() extended to 7 panels (angle, speed, crane pos/vel, rewards, acceleration, t_min), render() accepts save_path - RewardConfig: add t_min_crane, crane_velocity, angle, angular_velocity, crane_acceleration, angular_acceleration fields - TrainingConfig: add success_threshold field - EpRewardLogCallback: new module (callbacks.py); 3-bucket classification rail_hit/timelimit/success per interval; 13-column CSV log - ppo_agent: thread success_threshold through __init__/load/resume/do_training; do_one_episode() tracks t_min_trace and saves 7-panel PNG via render(save_path=) - play_ppo.py: --save-png/--no-save-png flag; PNG named {stem}_play_ss{speed}_ep{n}.png - train_ppo.py: --success-threshold CLI arg - experiments: add sig_t_min.yaml and hybrid_t_min.yaml; update hybrid_cv01.yaml - models: add hybrid_cv01_s42 and sig_t_min_s42 training logs (CSV) - tests: add t_min_crane reward term test; add success_threshold assertions

- ppo_agent: EpisodeResult dataclass (start_speed, ep_steps, ep_reward, terminated, truncated, success, t_min stats, x_pos/vel_final); do_one_episode() returns EpisodeResult; success_threshold param defaults to agent's training threshold - play_ppo.py: --speed-sweep (runs [0.5,1.0,2.0,3.0,5.0,7.0,10.0]); --save-csv/--no-save-csv (default True, writes {stem}_play_results.csv); per-episode one-line summary logged to CLI

- do_one_episode(): read t_min_start from env.reset() info dict instead of t_min_trace[0] (which was post-step-1, not the true initial state) - play_ppo.py: print per-speed summary table after --speed-sweep (speed, n, success%, rew_mean, rew_std, t_min_final_mean)

Records the first step at which t_min reaches its episode minimum, indicating when the active damping phase ends and the crane settles.

… at plateau Previous definition (argmin) captured transient dips, which gave misleading results (e.g. speed=2.0 showed step 9 due to an early overshoot). New definition: last step where t_min > t_min_final + 0.05, so settle_step reflects when active damping truly ends rather than the earliest low t_min.

…weep table - no_crash = not terminated (rail stay); success = reward threshold (scale-dependent) - theta_final, theta_dot_final read from final obs[2], obs[3] - sweep table now shows nocrash%, succ%, settle_step, t_min_final, theta_final, thdot_final

Completes the multi-dimensional end-state quality vector: x_pos, x_vel, theta, theta_dot, t_min (existing) + pendulum KE and crane acceleration at last step (new). Sweep table drops succ% (scale- dependent) and adds x_pos_f, energy_f, acc_f as envelope metrics.

…rop t_min_f) x_pos shown in metres (4 d.p.), energy normalised to fraction of initial KE (0=perfect, comparable across speeds), t_min_f dropped from terminal output (stays in CSV). succ% dropped (scale-dependent, replaced by envelope metrics).

…l_f, rename acc_f→x_acc_f rew_mean divided by ep_steps; energy_frac first quality column; x_vel_f added; acc_f renamed x_acc_f for consistency with x_pos_m/x_vel_f naming.

These continuous and discrete model checkpoints (seeds 42, 987, 5775, 827596) were trained under the old naming convention and replaced by the hybrid_cv01 and sig_t_min runs on this branch.

…tivity section - rename reward_comparison_s5775.md → reward_comparison.md - update index.rst reference - add §5 seed-sensitivity section: hybrid_cv01_s42 (crash-free, 1.46 cm attractor) and sig_t_min_s42 (crash band at +6.6–+7.8 m/s, 93% crash-free) - expand §6.1 summary table to 4 columns covering both seeds - commit s42 sweep CSVs and three new detail/comparison plots

…splay plot_sweep.py _plot_comparison: 3-panel → 2×3 grid, adding θ, θ_dot and acc columns so physics state is fully visible alongside reward signal. plot_training.py: solid/dashed linestyles for multi-CSV overlay. AntiPendulumEnv.show_plot: drop t_min panel — reward internals should not appear in physics output (7 → 6 panels).

…5775) Training logs and play results for the discrete action space variant, needed for the cont-vs-disc comparison in reward_comparison.md.

Previous s42 log was missing physical metric columns (x_pos, theta, settle_step), preventing apples-to-apples comparison with the discrete variant. Retraining with the updated callback adds these columns. sig_t_min models are superseded by hybrid_cv01 with RewardConfig.

…space analysis Reorganise from a monolithic discrete section to cont-vs-disc side-by-side per seed. Add §2 simulation environment (physical params, PPO hyperparameters from hybrid_cv01.yaml). Key finding: Discrete(3) reaches machine-epsilon position (1e-14 to 1e-16 m) vs ~1 cm for Box(-1,1); settle range nearly identical across seeds, suggesting a physics-determined bang-bang sequence. Update and add figures for s42 and s5775; remove sig_t_min figures.

Five time-series plots (position, velocity, θ, θ_dot, acc, reward) for seed 42 — disc 1.0 m/s baseline, cont vs disc at 5.0 m/s, and the non-converging cont failure at 9.0 m/s resolved by disc at the same speed.

…cripts callbacks.py: guard ep_info_buffer None check before len() (mypy arg-type). ppo_agent.py: extend type: ignore to cover call-arg on render(save_path=). play_ppo.py: extract _log_sweep_table to reduce main() statement count (PLR0915); zip strict=True; ax.grid(visible=True). plot_sweep.py: zip strict=True; ax.grid(visible=True); removesuffix(); _ZERO_SPEED_THRESHOLD constant; docstring on main(). plot_training.py: zip strict=True; ax.grid(visible=True); removesuffix().

eisDNV

Good work. We start from here and align other branches with the result in the next step.

aleksandarbabicdnv and others added 28 commits June 9, 2026 20:00

docs: add ASCII physics diagram to AntiPendulumEnv description

11927a5

fix: add call-overload to type: ignore on int(Mapping.get()) calls (m…

4a04c63

…ypy)

feat: add t_min_settle_step to EpisodeResult and sweep summary table

2c58f21

Records the first step at which t_min reaches its episode minimum, indicating when the active damping phase ends and the crane settles.

refactor: reshape sweep table — rew/step, energy_frac first, add x_ve…

503f7a7

…l_f, rename acc_f→x_acc_f rew_mean divided by ep_steps; energy_frac first quality column; x_vel_f added; acc_f renamed x_acc_f for consistency with x_pos_m/x_vel_f naming.

refactor: rename sweep table column settle → settle_step

3daf2cd

chore: remove superseded ppo_hcv01 model files

dececeb

These continuous and discrete model checkpoints (seeds 42, 987, 5775, 827596) were trained under the old naming convention and replaced by the hybrid_cv01 and sig_t_min runs on this branch.

refactor: add RewardConfig, t_min reward, update training pipeline

f66cdbe

feat: seed 5775 analysis -- reward_comparison doc, figures, sweep data

148f10a

feat: add Discrete(3) training results for hybrid_cv01 (seeds 42 and …

c89730e

…5775) Training logs and play results for the discrete action space variant, needed for the cont-vs-disc comparison in reward_comparison.md.

docs: add episode trajectory gallery to reward_comparison (§6.4)

1481f6e

Five time-series plots (position, velocity, θ, θ_dot, acc, reward) for seed 42 — disc 1.0 m/s baseline, cont vs disc at 5.0 m/s, and the non-converging cont failure at 9.0 m/s resolved by disc at the same speed.

style: add blank line after inline import (ruff format)

e3f37a6

docs: sync README model table; drop stale OOD eval figures

86913ae

aleksandarbabicdnv requested a review from eisDNV June 16, 2026 07:34

eisDNV approved these changes Jun 16, 2026

View reviewed changes

eisDNV merged commit b2835cb into main Jun 16, 2026
10 checks passed

eisDNV deleted the feat/t-min-reward branch June 16, 2026 08:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: t_min reward, discrete action space, seed sensitivity analysis#14

feat: t_min reward, discrete action space, seed sensitivity analysis#14
eisDNV merged 28 commits into
mainfrom
feat/t-min-reward

aleksandarbabicdnv commented Jun 16, 2026

Uh oh!

eisDNV left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

aleksandarbabicdnv commented Jun 16, 2026

Summary

Key finding

Changes

Test plan

Uh oh!

eisDNV left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants