Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
7ebe03a
feat: reward/obs refactor + continuous action space for PPO
aleksandarbabicdnv Jun 9, 2026
b2358e8
feat: add pre-trained hybrid_cv01 PPO models (6 seeds x discrete + co…
aleksandarbabicdnv Jun 9, 2026
7d9f73e
feat: add seed 827596 models + OOD episode figures to README
aleksandarbabicdnv Jun 10, 2026
ace3b13
fix: resolve pyright warnings introduced by this branch
aleksandarbabicdnv Jun 10, 2026
11927a5
docs: add ASCII physics diagram to AntiPendulumEnv description
aleksandarbabicdnv Jun 10, 2026
4a04c63
fix: add call-overload to type: ignore on int(Mapping.get()) calls (m…
aleksandarbabicdnv Jun 10, 2026
4506ec6
feat: add t_min_crane reward, success_threshold classification, and 7…
aleksandarbabicdnv Jun 11, 2026
1232ec9
feat: add EpisodeResult + speed sweep + CSV export to play evaluation
aleksandarbabicdnv Jun 11, 2026
c5b0063
fix: capture t_min_start from reset info; add speed-sweep summary table
aleksandarbabicdnv Jun 11, 2026
2c58f21
feat: add t_min_settle_step to EpisodeResult and sweep summary table
aleksandarbabicdnv Jun 11, 2026
3a2f444
fix: redefine t_min_settle_step as first step crane permanently stays…
aleksandarbabicdnv Jun 11, 2026
2ee9fdf
feat: add no_crash, theta_final, theta_dot_final to EpisodeResult + s…
aleksandarbabicdnv Jun 11, 2026
5370563
feat: add energy_final, acc_final to EpisodeResult; refine sweep table
aleksandarbabicdnv Jun 11, 2026
7352ede
refactor: make sweep table human-readable (x_pos in m, energy_frac, d…
aleksandarbabicdnv Jun 11, 2026
503f7a7
refactor: reshape sweep table — rew/step, energy_frac first, add x_ve…
aleksandarbabicdnv Jun 11, 2026
3daf2cd
refactor: rename sweep table column settle → settle_step
aleksandarbabicdnv Jun 11, 2026
dececeb
chore: remove superseded ppo_hcv01 model files
aleksandarbabicdnv Jun 15, 2026
f66cdbe
refactor: add RewardConfig, t_min reward, update training pipeline
aleksandarbabicdnv Jun 15, 2026
148f10a
feat: seed 5775 analysis -- reward_comparison doc, figures, sweep data
aleksandarbabicdnv Jun 15, 2026
a66bb5c
feat: extend reward comparison to seed 42; rename doc, add seed-sensi…
aleksandarbabicdnv Jun 15, 2026
e5167bd
refactor: extend sweep plot to 6 panels; remove t_min from episode di…
aleksandarbabicdnv Jun 16, 2026
c89730e
feat: add Discrete(3) training results for hybrid_cv01 (seeds 42 and …
aleksandarbabicdnv Jun 16, 2026
55f184d
feat: retrain hybrid_cv01_s42 continuous; remove sig_t_min model results
aleksandarbabicdnv Jun 16, 2026
59018bd
docs: restructure reward_comparison with seed sensitivity and action …
aleksandarbabicdnv Jun 16, 2026
1481f6e
docs: add episode trajectory gallery to reward_comparison (§6.4)
aleksandarbabicdnv Jun 16, 2026
f6ce382
fix: resolve mypy errors and ruff warnings in callbacks, ppo_agent, s…
aleksandarbabicdnv Jun 16, 2026
e3f37a6
style: add blank line after inline import (ruff format)
aleksandarbabicdnv Jun 16, 2026
86913ae
docs: sync README model table; drop stale OOD eval figures
aleksandarbabicdnv Jun 16, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 18 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -158,3 +158,21 @@ demos/**/*.nc

# Cmake generated files
CMakeUserPresets.json

# Model artifacts (large binaries and generated plots — not tracked by default)
models/*.zip
models/*.pkl
models/*.png
models/*_meta.json
models/*_vecnorm.pkl

# Pre-trained hybrid_cv01 models — explicitly tracked
!models/ppo_hcv01_*.zip
!models/ppo_hcv01_*_vecnorm.pkl
!models/ppo_hcv01_*_meta.json

# Q-agent test artifacts
tests/test_working_directory/*.json

# Local dev docs (not for the repo)
for_sig.md
47 changes: 47 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,49 @@ The changelog format is based on [Keep a Changelog](https://keepachangelog.com/e
## [Unreleased]

### Added
* Five new `RewardConfig` fields for a principled derivatives-based reward design:
`angle` (-theta^2), `angular_velocity` (-theta_dot^2), `crane_velocity` (-x_dot^2),
`crane_acceleration` (-x_ddot^2), `angular_acceleration` (-theta_ddot^2). All default to 0.0
for full backward compatibility with existing configs using `energy`.
Angular velocity uses pure theta_dot = `(cm_v[0] - origin_v[0]) / wire.length`,
excluding crane translation. Angular acceleration is computed via one-step
finite difference of theta_dot; zero on the first step after each episode reset.
* `AntiPendulumEnv` continuous observation `obs[3]` changed from absolute load
x-velocity (`wire.cm_v[0]`) to pure angular velocity theta_dot (rad/s), making the
observation independent of crane translation velocity.
* `experiments/derivatives_baseline.yaml`: starting config for the derivatives reward.
* `experiments/hybrid_cv01.yaml`: validated hybrid config (energy + crane_velocity + position
return). Seeds 2718, 3141, 31415 achieve 6/6 OOD generalisation at start_speed=7.0.
* `start_speed` field in `TrainingConfig` (default 1.0); wired through `train_ppo.py`
(`--start-speed`) and `play_ppo.py` (`--start-speed`). With `randomize_start=True` acts
as the upper bound of the per-episode speed sampling range `+-[min_speed, start_speed]`.
* `randomize_start` field in `TrainingConfig`; wired through both scripts.
`play_ppo.py` pre-parses `--model-path` to auto-load `randomize_start` from the model
sidecar; `--randomize-start` / `--no-randomize-start` override it.
* `RewardConfig`, `TrainingConfig`, and `ExperimentConfig` frozen dataclasses in new module
`src/crane_controller/experiment_config.py`. Replace the opaque `reward_fac` tuple with
named fields, eliminating the silent index-swap bug class.
* YAML experiment config support in `train_ppo.py` via `--config PATH`.
Missing YAML keys fall back to `RewardConfig`/`TrainingConfig` defaults.
* `--reward-fac ENERGY POSITIONAL TIME POSITION ACCELERATION` CLI override on `train_ppo.py`;
takes precedence over `--config`.
* JSON sidecar (`*_meta.json`) written alongside every saved model by `train_ppo.py` and read
automatically by `play_ppo.py` — reward weights follow the model without manual flags.
* `terminal_penalty` field in `RewardConfig`: one-time reward added on episode truncation
(OOB crash). Defaults to 0.0 (disabled). Used in `hybrid_cv01.yaml` as -5.0.
* `seed`, `ent_coef`, `learning_rate`, `clip_range`, `n_steps` parameters on
`ProximalPolicyOptimizationAgent.__init__` and corresponding CLI flags in `train_ppo.py`.
* `gamma` parameter on `ProximalPolicyOptimizationAgent` (default 0.99) and `--gamma` CLI flag in
`train_ppo.py` to configure the PPO discount factor without editing source code.
* `continuous_actions: bool` parameter on `AntiPendulumEnv` (default `False`). When `True`, the
action space is `Box([-1], [1])` and the action value is scaled by `acc` to produce crane
acceleration, enabling PPO to produce any acceleration in `[-acc, +acc]`. When `False` (default),
the action space remains `Discrete(3)` for full Q-agent backward compatibility.
`TrainingConfig.continuous_actions` (default `True`) and `--continuous-actions` /
`--no-continuous-actions` CLI flags in both `train_ppo.py` and `play_ppo.py` control this for
PPO workflows; Q-agent workflows pass `continuous_actions=False` explicitly.
`ppo_agent.do_one_episode()` updated to pass actions without casting to `int`, so both action
space types work correctly during inference.

### Fixed
* `ProximalPolicyOptimizationAgent.load()` now applies a `TimeLimit` wrapper (max 3000 steps),
Expand Down Expand Up @@ -36,6 +77,12 @@ The changelog format is based on [Keep a Changelog](https://keepachangelog.com/e
vs training step as a PNG alongside the model after each training run.

### Changed
* `AntiPendulumEnv` parameter `size` renamed to `rail_limit`; `TrainingConfig.size` renamed to
`rail_limit`; `--size` CLI flag renamed to `--rail-limit`. Semantics unchanged: half-span of
the crane rail in metres (crane spans +-rail_limit).
* `show_plot()` rewritten with 6 individual subplots (load angle, load speed + damping curve,
crane position + origin line, crane speed, rewards, x-acceleration), replacing the previous
`twinx()`-based layout that caused overlapping scales and colliding legends.
* Moved `logging.basicConfig` to the top of `main()` in `train_ppo.py` and `play_ppo.py` so
logging is configured before any application logic runs.
* Refactored `ProximalPolicyOptimizationAgent` API to separate training and inference concerns:
Expand Down
102 changes: 94 additions & 8 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -19,8 +19,31 @@ Environments
(``crane-controller`` library). The agent controls horizontal crane acceleration and must either
start or stop the pendulum motion.

- **Observation**: crane x-position, crane x-velocity, load polar angle, load x-velocity
- **Actions**: Discrete(3) — accelerate left / coast / accelerate right
.. code-block:: text

-rail_limit 0 +rail_limit
| | |
──────┼────────────────────────┼────────────────────────┼──── rail
┌─────┴─────┐
← ẍ ────────────── │ crane │ ────────────── ẍ →
└─────┬─────┘
│ obs[0] = x crane position
│ obs[1] = ẋ crane velocity
│ reward: −|x|, −ẋ²
L │
│╲ θ
│ ╲
│ ╲
│ ● load
obs[2] = θ polar angle from vertical
obs[3] = θ̇ angular velocity (pure)
reward: KE + PE

episode truncated (terminal_penalty) when |x| > rail_limit

- **Observation**: crane x-position, crane x-velocity, load polar angle, pure angular velocity θ̇ (rad/s)
- **Actions**: ``Discrete(3)`` by default (Q-agent compatible) — accelerate left / coast / accelerate right;
``Box([-1, 1])`` when ``continuous_actions=True`` (PPO default) — continuous acceleration command
- **Modes**: *start* (build pendulum energy) or *stop* (dampen swing)

``ControlledCraneEnv``
Expand Down Expand Up @@ -93,16 +116,48 @@ Tests are suitable for CI/CD — no plot windows are produced.
Training
^^^^^^^^

Experiment configs
""""""""""""""""""

PPO training is driven by YAML experiment config files in ``experiments/``.
Each file encodes both reward weights and training hyperparameters:

.. code-block:: yaml

# experiments/hybrid_cv01.yaml
reward:
energy: 1.0
crane_velocity: 0.1
position: 0.02
terminal_penalty: -5.0
training:
steps: 3000000
n_envs: 32
gamma: 0.99
n_steps: 4096
rail_limit: 2.0
randomize_start: true
start_speed: 1.0

Pass a config with ``--config PATH``; any key not present falls back to the dataclass defaults.
A JSON sidecar (``*_meta.json``) is written alongside every saved model so ``play_ppo.py``
can reconstruct the environment automatically — no ``--config`` needed at playback time.

**PPO:**

.. code-block:: shell

uv run python scripts/train_ppo.py
uv run python scripts/train_ppo.py --config experiments/hybrid_cv01.yaml \
--save-path models/my_model.zip --seed 42

Key options:

- ``--config PATH`` — load a YAML experiment config (reward weights + training hyperparams)
- ``--steps N`` — total training timesteps (default: 100 000)
- ``--n-envs N`` — number of parallel environments (default: 4)
- ``--seed N`` — RNG seed for reproducibility
- ``--continuous-actions`` / ``--no-continuous-actions`` — use ``Box([-1,1])`` or ``Discrete(3)`` action space (default: continuous)
- ``--randomize-start`` / ``--start-speed SPEED`` — randomise initial crane speed up to ±SPEED each episode
- ``--save-path PATH`` — where to write the trained model (default: ``models/ppo_AntiPendulumEnv.zip``)
- ``--resume-from PATH`` — continue training from a saved checkpoint; preserves VecNormalize statistics and learning rate schedule
- ``--dry-run`` — run 1 000 steps with a live reward-tracking plot and no model saved
Expand All @@ -128,23 +183,54 @@ Playing

Run a trained agent visually. Both scripts accept ``--render-mode`` with the following options:

- ``plot`` — 4-panel figure per episode (load angle, crane position/speed, rewards)
- ``plot`` — 6-panel figure per episode (load angle, load speed, crane position, crane speed, rewards, acceleration)
- ``play-back`` — animated crane trajectory after each episode
- ``reward-tracking`` — live reward line plot updating every step

Pre-trained models
""""""""""""""""""

Four pre-trained PPO models are included in ``models/`` (trained with ``experiments/hybrid_cv01.yaml``,
3M steps, 32 parallel envs): two action-space variants (Discrete and Box/continuous) across two
random seeds (42 and 5775). All generalise well beyond the training range across the full ±10 m/s
speed sweep (see ``docs/source/reward_comparison.md`` for detailed analysis).

+------------------------------------------+----------+------+
| Model | Actions | Seed |
+==========================================+==========+======+
| ``hybrid_cv01_disc_s42.zip`` | Discrete | 42 |
+------------------------------------------+----------+------+
| ``hybrid_cv01_disc_s5775.zip`` | Discrete | 5775 |
+------------------------------------------+----------+------+
| ``hybrid_cv01_s42.zip`` | Box | 42 |
+------------------------------------------+----------+------+
| ``hybrid_cv01_s5775.zip`` | Box | 5775 |
+------------------------------------------+----------+------+

Each model bundle requires three files: ``.zip`` (policy), ``_vecnorm.pkl`` (observation
normalisation statistics), ``_meta.json`` (reward config + flags). The ``play_ppo.py``
script locates the sidecar files automatically from ``--model-path``.

**PPO** (default render-mode: ``play-back``):

.. code-block:: shell

uv run python scripts/play_ppo.py --model-path models/ppo_AntiPendulumEnv.zip
uv run python scripts/play_ppo.py --model-path models/ppo_AntiPendulumEnv.zip --render-mode plot --episodes 3
uv run python scripts/play_ppo.py --model-path models/hybrid_cv01_disc_s42.zip --episodes 3 --render-mode plot
uv run python scripts/play_ppo.py --model-path models/hybrid_cv01_s42.zip --episodes 3 --render-mode plot

OOD evaluation (randomised start speed, 7× training range):

.. code-block:: shell

uv run python scripts/play_ppo.py --model-path models/hybrid_cv01_disc_s42.zip \
--episodes 6 --render-mode plot --randomize-start --start-speed 7.0

**Q-learning** (default render-mode: ``plot``):

.. code-block:: shell

uv run python scripts/play_q.py --model-path models/q_AntiPendulumEnv.json
uv run python scripts/play_q.py --model-path tests/anti-pendulum.json --render-mode play-back --episodes 3
uv run python scripts/play_q.py --model-path models/q_trained.json
uv run python scripts/play_q.py --model-path models/q_trained.json --render-mode play-back --episodes 3

Analysing
^^^^^^^^^
Expand Down
Binary file added assets/ood_eval_continuous_1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added assets/ood_eval_continuous_2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added assets/ood_eval_discrete_1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added assets/ood_eval_discrete_2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/source/_static/episode_cont_s42_v5p0.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/source/_static/episode_cont_s42_v9p0.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/source/_static/episode_disc_s42_v1p0.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/source/_static/episode_disc_s42_v5p0.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/source/_static/episode_disc_s42_v9p0.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/source/_static/fig1_training_s5775.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/source/_static/fig2_sweep_s5775.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/source/_static/fig_sweep_s42.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/source/_static/fig_training_s42.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/source/_static/hybrid_cv01_s42_detail.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/source/_static/hybrid_cv01_s5775_detail.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ crane-controller Documentation
:caption: Contents:

README
reward_comparison
assurance
api
CHANGELOG
Expand Down
Loading