Skip to content

feat: t_min reward, discrete action space, seed sensitivity analysis#14

Merged
eisDNV merged 28 commits into
mainfrom
feat/t-min-reward
Jun 16, 2026
Merged

feat: t_min reward, discrete action space, seed sensitivity analysis#14
eisDNV merged 28 commits into
mainfrom
feat/t-min-reward

Conversation

@aleksandarbabicdnv

Copy link
Copy Markdown
Collaborator

Summary

  • Introduces RewardConfig dataclass and a configurable t_min crane reward, replacing hard-coded reward weights with YAML-driven config
  • Trains hybrid_cv01 with both continuous Box(-1,1) and discrete Discrete(3) action spaces across seeds 42 and 5775 (3 M steps each)
  • Adds reward_comparison.md — a full analysis document covering training dynamics, speed sweep robustness, seed sensitivity, and action space comparison
  • Adds episode trajectory gallery (§6.4) showing cont vs disc behaviour at 1.0, 5.0, and 9.0 m/s

Key finding

Switching to Discrete(3) (bang-bang control) drops final crane position from 0.64–1.46 cm (continuous) to machine-epsilon level (1e-14–1e-16 m) across all 100 evaluated speeds and both seeds. Mean settle time roughly halves (95→49 s for s5775, 113→48 s for s42). The discrete policy also eliminates the 4 non-converging episodes present in continuous s42 at ±9.0 and ±3.4 m/s.

Settle-step range is nearly identical across seeds (6–87 s disc s5775, 6–81 s disc s42), indicating the bang-bang solution is determined by the physics rather than the training trajectory.

Changes

  • src/crane_controller/experiment_config.pyRewardConfig + YAML sidecar
  • src/crane_controller/envs/controlled_crane_pendulum.pyt_min reward term, show_plot 7→6 panels
  • src/crane_controller/callbacks.py — mypy fix (None guard on ep_info_buffer)
  • src/crane_controller/ppo_agent.pyEpisodeResult extended; mypy/ruff fixes
  • scripts/play_ppo.py — speed sweep, CSV export, PNG saving; ruff fixes
  • scripts/plot_sweep.py — 3-panel → 6-panel comparison; ruff fixes
  • scripts/plot_training.py — solid/dashed multi-CSV overlay; ruff fixes
  • models/ — hybrid_cv01 cont + disc results for seeds 42 and 5775
  • docs/source/reward_comparison.md — full analysis (7 sections, 13 figures)
  • docs/source/_static/ — training curves, sweep plots, detail figures, episode gallery

Test plan

  • uv run pytest — 43/43 passed
  • uv run mypy — clean
  • uv run ruff check + ruff format --check — clean
  • uv run pyright — no new errors (pre-existing pygame-ce stubs issue unrelated to this branch)

aleksandarbabicdnv and others added 28 commits June 9, 2026 20:00
Environment (AntiPendulumEnv):
- obs[3] changed from absolute load x-velocity to pure angular velocity
  theta_dot = (cm_v[0] - origin_v[0]) / wire.length, independent of crane
  translation. Improves PPO value estimation quality.
  WARNING for Q-agent: existing Q-tables are invalid after this change
  (observation space semantics changed) -- retrain from scratch with
  continuous_actions=False explicitly passed to AntiPendulumEnv.
- Five new reward terms (all default 0.0, fully backward-compatible):
  angle, angular_velocity, crane_velocity, crane_acceleration,
  angular_acceleration.
- size parameter renamed to rail_limit (half-span of crane rail in metres).
- continuous_actions parameter (default False): when True, action space is
  Box([-1, 1], shape=(1,)) -- the standard normalized action convention in
  the Gymnasium/SB3 ecosystem. Policy networks end with tanh, so [-1, 1] is
  the natural output range with no clipping needed. The environment does the
  physical scaling: crane_acc = action[0] * acc (two-layer pattern: normalized
  signal -> physical quantity). All Q-agent code passes continuous_actions=False
  explicitly to preserve Discrete(3) backward compatibility.

Experiment config system (new: src/crane_controller/experiment_config.py):
- RewardConfig, TrainingConfig, ExperimentConfig frozen dataclasses replace
  the opaque reward_fac tuple, eliminating the silent index-swap bug class.
- YAML experiment files in experiments/:
    baseline.yaml     -- original pre-refactor defaults for reference
    hybrid_cv01.yaml  -- validated config: energy=1.0, crane_velocity=0.1,
                         position=0.02, terminal_penalty=-5.0,
                         randomize_start=True
  Seeds 42, 987, 5775 all achieve 6/6 OOD generalisation at start_speed=7.0
  (7x the [0.1, 1.0] training range).
- JSON sidecar (*_meta.json) saved alongside every model; play_ppo.py reads
  it automatically so reward config and flags follow the model without manual
  CLI flags at playback time.

Scripts (train_ppo.py / play_ppo.py):
- New flags: --config PATH, --continuous-actions/--no-continuous-actions,
  --randomize-start/--no-randomize-start, --start-speed, --seed,
  --rail-limit, --resume-from.
- play_ppo.py pre-parses --model-path to auto-load sidecar defaults before
  argparse, so --continuous-actions/--randomize-start default to the values
  used during training.

Tests: 38 passing (16 new experiment_config tests; parametrized
discrete/continuous action-space and step tests).

README: updated obs description, action space docs, experiment config
section, pre-trained model table and play commands.
…ntinuous)

Trained with experiments/hybrid_cv01.yaml:
  energy=1.0, crane_velocity=0.1, position=0.02,
  terminal_penalty=-5.0, randomize_start=True,
  3M steps, 32 parallel envs.

All six models achieve 6/6 OOD generalisation at start_speed=7.0
(7x the [0.1, 1.0] training range).

Each model bundle (3 files):
  .zip          -- policy network weights (SB3 PPO)
  _vecnorm.pkl  -- observation normalisation statistics (required for
                   correct inference; must match the .zip it was saved with)
  _meta.json    -- reward config + flags (read automatically by play_ppo.py)

Play any model:
  uv run python scripts/play_ppo.py \
      --model-path models/ppo_hcv01_discrete_s42.zip \
      --episodes 3 --render-mode plot

  uv run python scripts/play_ppo.py \
      --model-path models/ppo_hcv01_continuous_s42.zip \
      --episodes 3 --render-mode plot

OOD test (7x training range):
  uv run python scripts/play_ppo.py \
      --model-path models/ppo_hcv01_discrete_s42.zip \
      --episodes 6 --render-mode plot --randomize-start --start-speed 7.0

Seeds: 42, 987, 5775 -- discrete and continuous variants of each (~145 KB per .zip).
Seed drawn from random.org (blind, before seeing results) to avoid
cherry-picking. Both discrete and continuous variants achieve 6/6 OOD
generalisation at start_speed=7.0 (7x the [0.1, 1.0] training range),
consistent with seeds 42, 987, and 5775.

Models (6 files, trained with experiments/hybrid_cv01.yaml, 3M steps):
  models/ppo_hcv01_{discrete,continuous}_s827596.{zip,_vecnorm.pkl,_meta.json}

README: added 4 representative OOD episode figures (2 discrete, 2 continuous)
under the pre-trained models section to illustrate the crane damping behaviour
at 7x training range. Figures committed to assets/.
- experiment_config.py: from_dict parameter type dict[str,object] -> Mapping[str,object]
  (dict is invariant; Mapping is covariant, so dict[str,float] callers are accepted).
  Mapping moved into TYPE_CHECKING block per ruff TC003.
  Unused write_text() return assigned to _.
- test_experiment_config.py: assign unused call results to _; suppress
  reportPrivateUsage on _meta_path import (internal utility, tested by design).
- controlled_crane_pendulum.py: suppress reportUnknownMemberType on self.reward
  usages (type flows from py_crane which has incomplete stubs -- pre-existing).
- test_environment.py: suppress reportUnknownMemberType on wire.cm_v and
  Discrete.n accesses (py_crane / gymnasium stubs).

Remaining pyright issues (17 errors, 38 warnings) are all in
controlled_mobile_crane.py (pygame stubs, pre-existing) and informational
unnecessary-type-ignore comments in ppo_agent.py / test_ppo.py (pre-existing).
…-panel plots

- AntiPendulumEnv: new t_min_crane reward term (-t_min weight), _get_info() returns
  t_min/x_pos/x_vel, show_plot() extended to 7 panels (angle, speed, crane pos/vel,
  rewards, acceleration, t_min), render() accepts save_path
- RewardConfig: add t_min_crane, crane_velocity, angle, angular_velocity,
  crane_acceleration, angular_acceleration fields
- TrainingConfig: add success_threshold field
- EpRewardLogCallback: new module (callbacks.py); 3-bucket classification
  rail_hit/timelimit/success per interval; 13-column CSV log
- ppo_agent: thread success_threshold through __init__/load/resume/do_training;
  do_one_episode() tracks t_min_trace and saves 7-panel PNG via render(save_path=)
- play_ppo.py: --save-png/--no-save-png flag; PNG named {stem}_play_ss{speed}_ep{n}.png
- train_ppo.py: --success-threshold CLI arg
- experiments: add sig_t_min.yaml and hybrid_t_min.yaml; update hybrid_cv01.yaml
- models: add hybrid_cv01_s42 and sig_t_min_s42 training logs (CSV)
- tests: add t_min_crane reward term test; add success_threshold assertions
- ppo_agent: EpisodeResult dataclass (start_speed, ep_steps, ep_reward,
  terminated, truncated, success, t_min stats, x_pos/vel_final);
  do_one_episode() returns EpisodeResult; success_threshold param defaults
  to agent's training threshold
- play_ppo.py: --speed-sweep (runs [0.5,1.0,2.0,3.0,5.0,7.0,10.0]);
  --save-csv/--no-save-csv (default True, writes {stem}_play_results.csv);
  per-episode one-line summary logged to CLI
- do_one_episode(): read t_min_start from env.reset() info dict instead of
  t_min_trace[0] (which was post-step-1, not the true initial state)
- play_ppo.py: print per-speed summary table after --speed-sweep
  (speed, n, success%, rew_mean, rew_std, t_min_final_mean)
Records the first step at which t_min reaches its episode minimum,
indicating when the active damping phase ends and the crane settles.
… at plateau

Previous definition (argmin) captured transient dips, which gave misleading
results (e.g. speed=2.0 showed step 9 due to an early overshoot).
New definition: last step where t_min > t_min_final + 0.05, so settle_step
reflects when active damping truly ends rather than the earliest low t_min.
…weep table

- no_crash = not terminated (rail stay); success = reward threshold (scale-dependent)
- theta_final, theta_dot_final read from final obs[2], obs[3]
- sweep table now shows nocrash%, succ%, settle_step, t_min_final, theta_final, thdot_final
Completes the multi-dimensional end-state quality vector:
x_pos, x_vel, theta, theta_dot, t_min (existing) + pendulum KE and
crane acceleration at last step (new). Sweep table drops succ% (scale-
dependent) and adds x_pos_f, energy_f, acc_f as envelope metrics.
…rop t_min_f)

x_pos shown in metres (4 d.p.), energy normalised to fraction of initial KE
(0=perfect, comparable across speeds), t_min_f dropped from terminal output
(stays in CSV). succ% dropped (scale-dependent, replaced by envelope metrics).
…l_f, rename acc_f→x_acc_f

rew_mean divided by ep_steps; energy_frac first quality column; x_vel_f added;
acc_f renamed x_acc_f for consistency with x_pos_m/x_vel_f naming.
These continuous and discrete model checkpoints (seeds 42, 987, 5775, 827596)
were trained under the old naming convention and replaced by the hybrid_cv01
and sig_t_min runs on this branch.
…tivity section

- rename reward_comparison_s5775.md → reward_comparison.md
- update index.rst reference
- add §5 seed-sensitivity section: hybrid_cv01_s42 (crash-free, 1.46 cm attractor) and
  sig_t_min_s42 (crash band at +6.6–+7.8 m/s, 93% crash-free)
- expand §6.1 summary table to 4 columns covering both seeds
- commit s42 sweep CSVs and three new detail/comparison plots
…splay

plot_sweep.py _plot_comparison: 3-panel → 2×3 grid, adding θ, θ_dot and
acc columns so physics state is fully visible alongside reward signal.
plot_training.py: solid/dashed linestyles for multi-CSV overlay.
AntiPendulumEnv.show_plot: drop t_min panel — reward internals should not
appear in physics output (7 → 6 panels).
…5775)

Training logs and play results for the discrete action space variant,
needed for the cont-vs-disc comparison in reward_comparison.md.
Previous s42 log was missing physical metric columns (x_pos, theta,
settle_step), preventing apples-to-apples comparison with the discrete
variant. Retraining with the updated callback adds these columns.
sig_t_min models are superseded by hybrid_cv01 with RewardConfig.
…space analysis

Reorganise from a monolithic discrete section to cont-vs-disc side-by-side
per seed. Add §2 simulation environment (physical params, PPO hyperparameters
from hybrid_cv01.yaml). Key finding: Discrete(3) reaches machine-epsilon
position (1e-14 to 1e-16 m) vs ~1 cm for Box(-1,1); settle range nearly
identical across seeds, suggesting a physics-determined bang-bang sequence.
Update and add figures for s42 and s5775; remove sig_t_min figures.
Five time-series plots (position, velocity, θ, θ_dot, acc, reward) for
seed 42 — disc 1.0 m/s baseline, cont vs disc at 5.0 m/s, and the
non-converging cont failure at 9.0 m/s resolved by disc at the same speed.
…cripts

callbacks.py: guard ep_info_buffer None check before len() (mypy arg-type).
ppo_agent.py: extend type: ignore to cover call-arg on render(save_path=).
play_ppo.py: extract _log_sweep_table to reduce main() statement count
  (PLR0915); zip strict=True; ax.grid(visible=True).
plot_sweep.py: zip strict=True; ax.grid(visible=True); removesuffix();
  _ZERO_SPEED_THRESHOLD constant; docstring on main().
plot_training.py: zip strict=True; ax.grid(visible=True); removesuffix().

@eisDNV eisDNV left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work. We start from here and align other branches with the result in the next step.

@eisDNV eisDNV merged commit b2835cb into main Jun 16, 2026
10 checks passed
@eisDNV eisDNV deleted the feat/t-min-reward branch June 16, 2026 08:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants