dnv-opensource · eisDNV · Jun 16, 2026 · Jun 9, 2026 · Jun 9, 2026 · Jun 10, 2026
diff --git a/.gitignore b/.gitignore
@@ -158,3 +158,21 @@ demos/**/*.nc
 
 # Cmake generated files
 CMakeUserPresets.json
+
+# Model artifacts (large binaries and generated plots — not tracked by default)
+models/*.zip
+models/*.pkl
+models/*.png
+models/*_meta.json
+models/*_vecnorm.pkl
+
+# Pre-trained hybrid_cv01 models — explicitly tracked
+!models/ppo_hcv01_*.zip
+!models/ppo_hcv01_*_vecnorm.pkl
+!models/ppo_hcv01_*_meta.json
+
+# Q-agent test artifacts
+tests/test_working_directory/*.json
+
+# Local dev docs (not for the repo)
+for_sig.md
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -6,8 +6,49 @@ The changelog format is based on [Keep a Changelog](https://keepachangelog.com/e
 ## [Unreleased]
 
 ### Added
+* Five new `RewardConfig` fields for a principled derivatives-based reward design:
+  `angle` (-theta^2), `angular_velocity` (-theta_dot^2), `crane_velocity` (-x_dot^2),
+  `crane_acceleration` (-x_ddot^2), `angular_acceleration` (-theta_ddot^2). All default to 0.0
+  for full backward compatibility with existing configs using `energy`.
+  Angular velocity uses pure theta_dot = `(cm_v[0] - origin_v[0]) / wire.length`,
+  excluding crane translation. Angular acceleration is computed via one-step
+  finite difference of theta_dot; zero on the first step after each episode reset.
+* `AntiPendulumEnv` continuous observation `obs[3]` changed from absolute load
+  x-velocity (`wire.cm_v[0]`) to pure angular velocity theta_dot (rad/s), making the
+  observation independent of crane translation velocity.
+* `experiments/derivatives_baseline.yaml`: starting config for the derivatives reward.
+* `experiments/hybrid_cv01.yaml`: validated hybrid config (energy + crane_velocity + position
+  return). Seeds 2718, 3141, 31415 achieve 6/6 OOD generalisation at start_speed=7.0.
+* `start_speed` field in `TrainingConfig` (default 1.0); wired through `train_ppo.py`
+  (`--start-speed`) and `play_ppo.py` (`--start-speed`). With `randomize_start=True` acts
+  as the upper bound of the per-episode speed sampling range `+-[min_speed, start_speed]`.
+* `randomize_start` field in `TrainingConfig`; wired through both scripts.
+  `play_ppo.py` pre-parses `--model-path` to auto-load `randomize_start` from the model
+  sidecar; `--randomize-start` / `--no-randomize-start` override it.
+* `RewardConfig`, `TrainingConfig`, and `ExperimentConfig` frozen dataclasses in new module
+  `src/crane_controller/experiment_config.py`. Replace the opaque `reward_fac` tuple with
+  named fields, eliminating the silent index-swap bug class.
+* YAML experiment config support in `train_ppo.py` via `--config PATH`.
+  Missing YAML keys fall back to `RewardConfig`/`TrainingConfig` defaults.
+* `--reward-fac ENERGY POSITIONAL TIME POSITION ACCELERATION` CLI override on `train_ppo.py`;
+  takes precedence over `--config`.
+* JSON sidecar (`*_meta.json`) written alongside every saved model by `train_ppo.py` and read
+  automatically by `play_ppo.py` — reward weights follow the model without manual flags.
+* `terminal_penalty` field in `RewardConfig`: one-time reward added on episode truncation
+  (OOB crash). Defaults to 0.0 (disabled). Used in `hybrid_cv01.yaml` as -5.0.
+* `seed`, `ent_coef`, `learning_rate`, `clip_range`, `n_steps` parameters on
+  `ProximalPolicyOptimizationAgent.__init__` and corresponding CLI flags in `train_ppo.py`.
 * `gamma` parameter on `ProximalPolicyOptimizationAgent` (default 0.99) and `--gamma` CLI flag in
   `train_ppo.py` to configure the PPO discount factor without editing source code.
+* `continuous_actions: bool` parameter on `AntiPendulumEnv` (default `False`). When `True`, the
+  action space is `Box([-1], [1])` and the action value is scaled by `acc` to produce crane
+  acceleration, enabling PPO to produce any acceleration in `[-acc, +acc]`. When `False` (default),
+  the action space remains `Discrete(3)` for full Q-agent backward compatibility.
+  `TrainingConfig.continuous_actions` (default `True`) and `--continuous-actions` /
+  `--no-continuous-actions` CLI flags in both `train_ppo.py` and `play_ppo.py` control this for
+  PPO workflows; Q-agent workflows pass `continuous_actions=False` explicitly.
+  `ppo_agent.do_one_episode()` updated to pass actions without casting to `int`, so both action
+  space types work correctly during inference.
 
 ### Fixed
 * `ProximalPolicyOptimizationAgent.load()` now applies a `TimeLimit` wrapper (max 3000 steps),
@@ -36,6 +77,12 @@ The changelog format is based on [Keep a Changelog](https://keepachangelog.com/e
   vs training step as a PNG alongside the model after each training run.
 
 ### Changed
+* `AntiPendulumEnv` parameter `size` renamed to `rail_limit`; `TrainingConfig.size` renamed to
+  `rail_limit`; `--size` CLI flag renamed to `--rail-limit`. Semantics unchanged: half-span of
+  the crane rail in metres (crane spans +-rail_limit).
+* `show_plot()` rewritten with 6 individual subplots (load angle, load speed + damping curve,
+  crane position + origin line, crane speed, rewards, x-acceleration), replacing the previous
+  `twinx()`-based layout that caused overlapping scales and colliding legends.
 * Moved `logging.basicConfig` to the top of `main()` in `train_ppo.py` and `play_ppo.py` so
   logging is configured before any application logic runs.
 * Refactored `ProximalPolicyOptimizationAgent` API to separate training and inference concerns:

diff --git a/README.rst b/README.rst
@@ -19,8 +19,31 @@ Environments
     (``crane-controller`` library). The agent controls horizontal crane acceleration and must either
     start or stop the pendulum motion.
 
-    - **Observation**: crane x-position, crane x-velocity, load polar angle, load x-velocity
-    - **Actions**: Discrete(3) — accelerate left / coast / accelerate right
+    .. code-block:: text
+
+          -rail_limit                   0                  +rail_limit
+               |                        |                        |
+         ──────┼────────────────────────┼────────────────────────┼──── rail
+                                  ┌─────┴─────┐
+               ← ẍ ────────────── │   crane   │ ────────────── ẍ →
+                                  └─────┬─────┘
+                                        │   obs[0] = x    crane position
+                                        │   obs[1] = ẋ    crane velocity
+                                        │   reward: −|x|, −ẋ²
+                                     L  │
+                                        │╲ θ
+                                        │ ╲
+                                        │  ╲
+                                        │   ●  load
+                                              obs[2] = θ   polar angle from vertical
+                                              obs[3] = θ̇   angular velocity (pure)
+                                              reward: KE + PE
+
+               episode truncated (terminal_penalty) when |x| > rail_limit
+
+    - **Observation**: crane x-position, crane x-velocity, load polar angle, pure angular velocity θ̇ (rad/s)
+    - **Actions**: ``Discrete(3)`` by default (Q-agent compatible) — accelerate left / coast / accelerate right;
+      ``Box([-1, 1])`` when ``continuous_actions=True`` (PPO default) — continuous acceleration command
     - **Modes**: *start* (build pendulum energy) or *stop* (dampen swing)
 
 ``ControlledCraneEnv``
@@ -93,16 +116,48 @@ Tests are suitable for CI/CD — no plot windows are produced.
 Training
 ^^^^^^^^
 
+Experiment configs
+""""""""""""""""""
+
+PPO training is driven by YAML experiment config files in ``experiments/``.
+Each file encodes both reward weights and training hyperparameters:
+
+.. code-block:: yaml
+
+   # experiments/hybrid_cv01.yaml
+   reward:
+     energy: 1.0
+     crane_velocity: 0.1
+     position: 0.02
+     terminal_penalty: -5.0
+   training:
+     steps: 3000000
+     n_envs: 32
+     gamma: 0.99
+     n_steps: 4096
+     rail_limit: 2.0
+     randomize_start: true
+     start_speed: 1.0
+
+Pass a config with ``--config PATH``; any key not present falls back to the dataclass defaults.
+A JSON sidecar (``*_meta.json``) is written alongside every saved model so ``play_ppo.py``
+can reconstruct the environment automatically — no ``--config`` needed at playback time.
+
 **PPO:**
 
 .. code-block:: shell
 
-   uv run python scripts/train_ppo.py
+   uv run python scripts/train_ppo.py --config experiments/hybrid_cv01.yaml \
+       --save-path models/my_model.zip --seed 42
 
 Key options:
 
+- ``--config PATH`` — load a YAML experiment config (reward weights + training hyperparams)
 - ``--steps N`` — total training timesteps (default: 100 000)
 - ``--n-envs N`` — number of parallel environments (default: 4)
+- ``--seed N`` — RNG seed for reproducibility
+- ``--continuous-actions`` / ``--no-continuous-actions`` — use ``Box([-1,1])`` or ``Discrete(3)`` action space (default: continuous)
+- ``--randomize-start`` / ``--start-speed SPEED`` — randomise initial crane speed up to ±SPEED each episode
 - ``--save-path PATH`` — where to write the trained model (default: ``models/ppo_AntiPendulumEnv.zip``)
 - ``--resume-from PATH`` — continue training from a saved checkpoint; preserves VecNormalize statistics and learning rate schedule
 - ``--dry-run`` — run 1 000 steps with a live reward-tracking plot and no model saved
@@ -128,23 +183,54 @@ Playing
 
 Run a trained agent visually. Both scripts accept ``--render-mode`` with the following options:
 
-- ``plot`` — 4-panel figure per episode (load angle, crane position/speed, rewards)
+- ``plot`` — 6-panel figure per episode (load angle, load speed, crane position, crane speed, rewards, acceleration)
 - ``play-back`` — animated crane trajectory after each episode
 - ``reward-tracking`` — live reward line plot updating every step
 
+Pre-trained models
+""""""""""""""""""
+
+Four pre-trained PPO models are included in ``models/`` (trained with ``experiments/hybrid_cv01.yaml``,
+3M steps, 32 parallel envs): two action-space variants (Discrete and Box/continuous) across two
+random seeds (42 and 5775). All generalise well beyond the training range across the full ±10 m/s
+speed sweep (see ``docs/source/reward_comparison.md`` for detailed analysis).
+
++------------------------------------------+----------+------+
+| Model                                    | Actions  | Seed |
++==========================================+==========+======+
+| ``hybrid_cv01_disc_s42.zip``             | Discrete | 42   |
++------------------------------------------+----------+------+
+| ``hybrid_cv01_disc_s5775.zip``           | Discrete | 5775 |
++------------------------------------------+----------+------+
+| ``hybrid_cv01_s42.zip``                  | Box      | 42   |
++------------------------------------------+----------+------+
+| ``hybrid_cv01_s5775.zip``               | Box      | 5775 |
++------------------------------------------+----------+------+
+
+Each model bundle requires three files: ``.zip`` (policy), ``_vecnorm.pkl`` (observation
+normalisation statistics), ``_meta.json`` (reward config + flags). The ``play_ppo.py``
+script locates the sidecar files automatically from ``--model-path``.
+
 **PPO** (default render-mode: ``play-back``):
 
 .. code-block:: shell
 
-   uv run python scripts/play_ppo.py --model-path models/ppo_AntiPendulumEnv.zip
-   uv run python scripts/play_ppo.py --model-path models/ppo_AntiPendulumEnv.zip --render-mode plot --episodes 3
+   uv run python scripts/play_ppo.py --model-path models/hybrid_cv01_disc_s42.zip --episodes 3 --render-mode plot
+   uv run python scripts/play_ppo.py --model-path models/hybrid_cv01_s42.zip --episodes 3 --render-mode plot
+
+OOD evaluation (randomised start speed, 7× training range):
+
+.. code-block:: shell
+
+   uv run python scripts/play_ppo.py --model-path models/hybrid_cv01_disc_s42.zip \
+       --episodes 6 --render-mode plot --randomize-start --start-speed 7.0
 
 **Q-learning** (default render-mode: ``plot``):
 
 .. code-block:: shell
 
-   uv run python scripts/play_q.py --model-path models/q_AntiPendulumEnv.json
-   uv run python scripts/play_q.py --model-path tests/anti-pendulum.json --render-mode play-back --episodes 3
+   uv run python scripts/play_q.py --model-path models/q_trained.json
+   uv run python scripts/play_q.py --model-path models/q_trained.json --render-mode play-back --episodes 3
 
 Analysing
 ^^^^^^^^^

diff --git a/assets/ood_eval_continuous_1.png b/assets/ood_eval_continuous_1.png
diff --git a/assets/ood_eval_continuous_2.png b/assets/ood_eval_continuous_2.png
diff --git a/assets/ood_eval_discrete_1.png b/assets/ood_eval_discrete_1.png
diff --git a/assets/ood_eval_discrete_2.png b/assets/ood_eval_discrete_2.png
diff --git a/docs/source/_static/episode_cont_s42_v5p0.png b/docs/source/_static/episode_cont_s42_v5p0.png
diff --git a/docs/source/_static/episode_cont_s42_v9p0.png b/docs/source/_static/episode_cont_s42_v9p0.png
diff --git a/docs/source/_static/episode_disc_s42_v1p0.png b/docs/source/_static/episode_disc_s42_v1p0.png
diff --git a/docs/source/_static/episode_disc_s42_v5p0.png b/docs/source/_static/episode_disc_s42_v5p0.png
diff --git a/docs/source/_static/episode_disc_s42_v9p0.png b/docs/source/_static/episode_disc_s42_v9p0.png
diff --git a/docs/source/_static/fig1_training_s5775.png b/docs/source/_static/fig1_training_s5775.png
diff --git a/docs/source/_static/fig2_sweep_s5775.png b/docs/source/_static/fig2_sweep_s5775.png
diff --git a/docs/source/_static/fig_sweep_s42.png b/docs/source/_static/fig_sweep_s42.png
diff --git a/docs/source/_static/fig_training_s42.png b/docs/source/_static/fig_training_s42.png
diff --git a/docs/source/_static/hybrid_cv01_disc_s42_detail.png b/docs/source/_static/hybrid_cv01_disc_s42_detail.png
diff --git a/docs/source/_static/hybrid_cv01_disc_s5775_detail.png b/docs/source/_static/hybrid_cv01_disc_s5775_detail.png
diff --git a/docs/source/_static/hybrid_cv01_s42_detail.png b/docs/source/_static/hybrid_cv01_s42_detail.png
diff --git a/docs/source/_static/hybrid_cv01_s5775_detail.png b/docs/source/_static/hybrid_cv01_s5775_detail.png
diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -8,6 +8,7 @@ crane-controller Documentation
    :caption: Contents:
 
    README
+   reward_comparison
    assurance
    api
    CHANGELOG