Releases: CodeAlive-AI/ai-driven-development
v3.6.2
Patch release correcting Codex hook semantics.
- Makes Codex bash-guard prompt deferral opt-in instead of default.
- Keeps fail-closed deny as the default Codex live behavior.
- Updates hooks-management guidance for Codex App/CLI after upstream research.
Related bash-guard release: bash-guard-v0.3.2
v3.6.1 — Codex prompt rules bridge
Patch release for Codex prompt semantics in bash-guard.
- Keeps Codex bash-guard live by default instead of putting the whole hook into shadow mode.
- Adds explicit defer-to-execpolicy support for selected bash-guard reason codes.
- Installers pair
supabase.db_pushdefer with a managed Codexprefix_rule(... decision="prompt"). - Documents the safe pattern for adapting Claude Code
askhooks to Codex.
Verified with Go tests, shell syntax checks, isolated installer tests, direct hook smoke tests, and local codex exec behavior.
v3.6.0 — Codex App hooks support
Updates hooks-management and balanced-safety-hooks for current Codex CLI / Codex App hook behavior.
Highlights
- Actualized
hooks-managementfor June 2026 Codex docs: hooks enabled by default, canonical[features].hooks, 10 lifecycle events, shared CLI/App config layers,/hooksreview and trust flow. - Added Codex App / Codex CLI install path for bash-guard via
~/.codex/hooks.json. - Documented the important Codex limitation:
PreToolUsedoes not supportpermissionDecision: "ask"; Codex live mode hard-blocks risky Bash commands viadeny. - Preserved Claude Code behavior: risky Bash commands still emit
permissionDecision: "ask". - Expanded git safety fixtures for branch creation and stash-sweeping commands.
Verification
go test -count=1 ./...inhooks/balanced-safety-hooks/srcbash -nfor both installersgit diff --check- Isolated installer install/uninstall smoke test for
--both
See also the binary release: bash-guard-v0.3.0.
bash-guard v0.3.2
Patch release for Codex CLI/App safety semantics.
- Codex live mode now defaults to fail-closed deny for bash-guard ask decisions.
- Added --codex-native-prompts as explicit opt-in for best-effort native Codex execpolicy prompts.
- Documented the Codex PreToolUse ask limitation and open runtime risk with prompt rules under full access.
Verification:
- go test -count=1 ./...
- bash -n install.sh && bash -n install-prebuilt.sh && git diff --check
- installer smoke test with and without --codex-native-prompts
- local codex exec fake supabase db push blocked by PreToolUse hook
bash-guard v0.3.1 — Codex prompt rules bridge
Patch release for Codex App / CLI semantics.
What changed
- Keeps Codex bash-guard live by default: risky commands still hard-block through PreToolUse.
- Adds
BASH_GUARD_CODEX_DEFER_REASON_CODESfor a small explicit set of bash-guard reason codes that should be handled by Codex execpolicy prompts. - Installers configure
supabase.db_pushas the first deferred reason code and add a pairedprefix_rule(... decision="prompt")to~/.codex/rules/default.rules. - This preserves Claude Code-style semantics: everything is allowed except hook-described risky actions; where Codex can show a native prompt, the hook defers to that prompt, otherwise it blocks.
Verified
go test -count=1 ./...- shell syntax checks for both installers
- isolated Codex install/uninstall smoke test
- local
codex execwith a fake Supabase binary:supabase db pushreaches execpolicy prompt;rm -rf /etc/...is blocked by PreToolUse hook.
After installing/updating in Codex App, restart the app and review/trust the modified hook in /hooks.
bash-guard v0.3.0 — Codex App support
Adds Codex CLI / Codex App support to bash-guard while preserving Claude Code behavior.
What's new
- New
BASH_GUARD_ADAPTER=codexwire adapter. Internalaskdecisions map to CodexpermissionDecision: "deny"because Codex PreToolUse does not support hook-createdaskprompts yet. - Source and prebuilt installers now support
--codexand--both, writing~/.codex/hooks.jsonwithPreToolUse[matcher=^Bash$]. - Codex allow path emits empty stdout, matching Codex hook semantics.
- Project config discovery now accepts
.codex/bash-guard.tomlalongside.claude/bash-guard.toml, still gated by trusted-projects. - Audit log records adapter and emitted decision.
- Git guard coverage now asks before
git stash/stash push/stash saveand new branch creation viacheckout -b/-Borswitch -c/-C.
Verification
go test -count=1 ./...bash -n install.sh install-prebuilt.sh- Isolated installer install/uninstall test for
--both - Manual Codex wire smoke test: risky Bash emits
deny; safe Bash emits no stdout.
Quick install
# Claude Code
curl -fsSL https://raw.githubusercontent.com/CodeAlive-AI/ai-driven-development/main/hooks/balanced-safety-hooks/install-prebuilt.sh | sh
# Codex CLI / Codex App
curl -fsSL https://raw.githubusercontent.com/CodeAlive-AI/ai-driven-development/main/hooks/balanced-safety-hooks/install-prebuilt.sh | sh -s -- --codexAfter installing for Codex, restart Codex and open /hooks to review/trust the new hook if prompted.
v3.5.0 — Rich grading, multi-run variance, blind compare, HTML viewer
Summary
Five additions to the SkillOpt loop, adapted from upstream anthropics/skills' skill-creator eval infrastructure but reframed for management/optimisation (not creation). Addresses a gap identified in the v3.4 retrospective: the validation gate uses the same verifier that proposed edits, which can be self-confirming.
New in optimize_skill.py
| Flag | What it does |
|---|---|
| `--runs-per-task N` | Each task executed N times. `rollouts.jsonl` gains `score_mean`/`score_stddev`/`score_min`/`score_max`/`runs[]`. Validation gate uses the mean. Use when the verifier or agent is noisy. |
| `--verifier assertions` | New grading mode. Tasks gain `assertions[]` (declarative checks). Grader returns rich `grading.json` with per-assertion pass/fail + evidence, extracted claims, and `eval_feedback` — a critique of the assertions themselves that flags weak / non-discriminating checks. |
`optimization_report.md` gains an Assertion critique section aggregating `eval_feedback.suggestions` across the run, deduped + ranked by frequency. Operator gets back actionable improvements to the eval set.
New scripts
| Script | Purpose |
|---|---|
| `scripts/blind_comparator.py` | Independent A/B judge between two skills on the same tasks. Randomised X/Y labels per task with a seed. Aggregated to `comparison_report.{json,md}`. Catches self-confirming gate behaviour. |
| `scripts/eval_viewer.py` | Single-page static HTML renderer for an output-dir. Per-epoch SVG chart, accepted/rejected edit timelines, slow-update history, per-task rollouts with grading checklists, initial→best diff. `--compare` mode for two runs side-by-side. No JS / CSS dependencies. |
New prompt contracts (in `prompts/`)
- `grader.md` — assertions verifier grader
- `blind_comparator.md` — blind A/B judge
Reference updates
- `optimization-artifacts-schemas.md` (+193 lines): `tasks.jsonl` with `assertions[]`/`files[]`, `rollouts.jsonl` with multi-run + grading fields, `decision.json` with stddev, new blind comparator artefacts, eval viewer artefact. All v3.5 changes are additive; schema stability is a v3.x guarantee.
- `optimization-grading-checklist.md` (+63 lines): pre-flight stddev check, per-task variance review, assertion critique review, blind comparison verdict review. New red flag: "blind comparator says a_wins ≥ b_wins despite SkillOpt accepting edits — investigate before shipping."
Compatibility
- 100% backward compatible. Default `--runs-per-task=1` + `--verifier llm-judge` produces byte-equivalent output to v3.4.
- Python 3.10+, stdlib only. No new dependencies.
Test plan
- `python3 scripts/optimize_skill.py --help` shows `--runs-per-task` and `assertions` choice
- Run a small task set with `--verifier assertions --runs-per-task 3`, confirm `grading` field in `rollouts.jsonl` and `score_stddev` in `decision.json`
- Confirm `optimization_report.md` contains an "Assertion critique" section
- Run `blind_comparator.py --skill-a initial_skill.md --skill-b best_skill.md --tasks tasks.jsonl --dry-run`
- Run `eval_viewer.py runs/r1` and open the produced HTML in a browser
- Run `eval_viewer.py runs/r1 runs/r2 --compare` and confirm side-by-side layout
🤖 Generated with Claude Code
v3.4.0 — Artefact schemas, run aggregator, grading checklist
Summary
Three additions to skills-management, adopted and adapted from upstream anthropics/skills' skill-creator (commit `690f15c`, May 2026). The upstream is creator-focused; these adaptations target the management / optimisation context.
What's new
| File | Purpose |
|---|---|
references/optimization-artifacts-schemas.md |
Formal JSON schemas for every artefact written by `optimize_skill.py` and `log_skill_edit.py`: `splits.json`, `state.json`, `rollouts.jsonl`, `proposals.json`, `decision.json`, `edit_apply_report.json`, `rejected_buffer.json`, `meta_skill.json`, `optimization_report.md` frontmatter, `test_rollouts.jsonl`, `.skill_edit_log.jsonl`, `.skill_snapshots/`. Schema stability is a v3.x guarantee — breaking changes bump major. |
| `scripts/aggregate_runs.py` | Aggregate N optimisation runs into a single summary. Computes mean/stddev/min/max for `test_score`, `tokens_delta`, `test_delta_vs_baseline`, plus per-run lists for accepted/rejected edits. `--compare` for side-by-side; text / json / md output; robust to incomplete runs. |
| `references/optimization-grading-checklist.md` | Audit checklist applied to a finished optimisation run before shipping `best_skill.md`. Pre-flight + per-artefact review (best_skill, edit_apply_report, rejected_buffer, optimization_report+test_rollouts, meta_skill) + red flags + green-light decision paths. |
What we deliberately skipped from upstream
These are creator-only machinery for description prose iteration, not a fit for managing/auditing existing skills:
- `agents/grader.md`, `agents/analyzer.md`, `agents/comparator.md` — eval-loop orchestration
- `scripts/run_loop.py`, `scripts/improve_description.py`, `scripts/run_eval.py` — description-iteration loop
- `eval-viewer/` — creator-side visualisation
SKILL.md changes
- Quick Reference adds the aggregator
- References table adds the two new docs
Compatibility
- Backwards compatible. All v3.3.0 scripts and schemas unchanged.
- Python 3.10+, stdlib only. No new dependencies.
Test plan
- `python3 scripts/aggregate_runs.py --help` exits 0
- Run `optimize_skill.py` once, then `aggregate_runs.py ` and confirm summary
- Run `optimize_skill.py` twice with different seeds, then `aggregate_runs.py r1 r2 --compare`
- Read `references/optimization-grading-checklist.md` end-to-end against a real run
🤖 Generated with Claude Code
v3.3.0 — SkillOpt training loop for skills-management
Summary
skills-management becomes a trainable skill manager, not just a CRUD tool. Based on SkillOpt (Microsoft, May 2026): treat the SKILL.md document as the external trainable state of a frozen agent, with the same discipline that makes weight-space optimisation reproducible — bounded edits, held-out validation gate, rejected-edit buffer, epoch-wise slow/meta update.
New scripts
| Script | Purpose |
|---|---|
scripts/optimize_skill.py |
Full SkillOpt loop: train/sel/test splits, rollout via claude -p, failure/success mini-batch reflection, hierarchical merge, ranked bounded apply (constant/linear/cosine L_t schedules), strict-greater validation gate, rejected-edit buffer, epoch-boundary slow update into a protected section, optimiser-side meta-skill. Supports --dry-run and --resume. |
scripts/log_skill_edit.py |
Append-only audit log with SHA chain, token delta, optional --snapshot. |
scripts/diff_skill_versions.py |
Diff between git commits, explicit files, or snapshot history; unified/stats/side-by-side formats. |
scripts/trigger_test.py |
Trigger tests with heuristic or claude-cli judge, P/R/F1 metrics. --generate seeds cases.yaml from the description. |
scripts/transfer_test.py |
Structural verification of a skill across all 42 supported agents; --copy --to <agent>, --all. |
New prompt contracts (verbatim §C.2 of the paper)
prompts/analyst_error.md, analyst_success.md, merge_failure.md, merge_success.md, merge_final.md, ranking.md, slow_update.md, meta_skill.md.
New reference
references/skill-optimization.md — when to optimise, five core principles, targets (300-2000 tokens, 1-4 accepted edits), one-page algorithm, hyperparameter defaults, six anti-patterns, transfer evidence.
review_skill.py upgrades
- Token footprint (300-2000 target per Table 6 of the paper; penalties at 2000/4000)
- Procedurality check (instance-specific markers — filenames, literal numbers, task references — should be rare)
- Patch-friendliness (anchor density + duplicate-anchor detection for reliable
insert_afteredits) - Slow-update section integrity (balanced
<!-- SLOW_UPDATE_START -->/<!-- SLOW_UPDATE_END -->markers)
JSON output gains body_tokens, references_tokens, slow_update_tokens, total_tokens, anchor_density, heading_count. CLI flags and exit codes unchanged.
SKILL.md changes
- New section: Optimize a Skill (SkillOpt-style)
- Quick Reference gains 8 new commands (optimize, log, diff, trigger-test, transfer-test, generate cases)
- Documented
<!-- SLOW_UPDATE_START/END -->protected-section convention - Description extended with new trigger phrases: "optimise skill", "train skill on tasks", "iterate skill", "audit skill edits", "log skill edit", "diff skill versions", "trigger test skill", "transfer skill across agents"
Patterns reference
06-patterns-and-troubleshooting.md gains Pattern 6: Validated iterative refinement with the blind-rewrites anti-pattern.
Compatibility
- Backwards compatible. All existing scripts unchanged.
review_skill.pykeeps every existing rule. - Python 3.10+, stdlib only. No new dependencies.
optimize_skill.pyshells out toclaude -pfor rollouts and optimiser calls — inherits whatever subscription/API the user has configured.
Test plan
- Run
python3 scripts/review_skill.py <any-skill>and verify the new Token footprint block appears. - Run
python3 scripts/optimize_skill.py <skill> --tasks tasks.jsonl --dry-runand verify schedule + splits + prompt previews print without LLM calls. - Run
python3 scripts/log_skill_edit.py <skill> --reason "test" --dry-runand verify the planned entry. - Run
python3 scripts/trigger_test.py <skill> --generate > cases.yamlthen--cases cases.yaml. - Run
python3 scripts/transfer_test.py <skill> --allto verify cross-agent placement.
🤖 Generated with Claude Code
refactoring-csharp v0.1.0
Roslyn-based C# rename refactorer packaged as an agent skill.\n\nAssets include release installers, the skill archive, self-contained CLI binaries for macOS/Linux/Windows, and SHA256 checksums.\n\nQuick install:\n\nmacOS/Linux:\nbash\ncurl -fsSL https://github.com/CodeAlive-AI/ai-driven-development/releases/download/refactoring-csharp-v0.1.0/install.sh | bash\n\n\nWindows PowerShell:\npowershell\nirm https://github.com/CodeAlive-AI/ai-driven-development/releases/download/refactoring-csharp-v0.1.0/install.ps1 | iex\n