Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,136 @@
---
catalogue_id: F41
title: "Source-surface leakage of codegen-internal primitive — type-suffix name fossilizes in user-facing API"
family: F1-Sediment (design-surface contamination sub-form)
severity: P1
status: ratified_2026-05-20
empirical_project: Cobrust Phase G sprint (2026-05-19/20)
cobrust_local_id: F38 (f38-source-surface-leakage-codegen-primitive.md)
date_ratified: 2026-05-20
cobrust_sha: 46c0946
resolution_adr: ADR-0064 (print-monomorphization-source-surface-cleanup)
constitutional_binding: CLAUDE.md §2.5 (LLM-first design principle, training-data-overlap rule)
---

# F41 — Source-surface leakage of codegen-internal primitive

## Pattern

A codegen-internal primitive — named by type shape (`<verb>_<type>`, e.g.,
`print_int`, `print_str`) — leaks into the source-face PRELUDE during a
demo sprint. It fossilizes when subsequent waves do not audit the question:
"is this name source-face API or codegen-internal symbol?"

The leak path:

1. Demo sprint needs to prove codegen works → quickest route is direct
monomorphic names (`print_int`, `print_str`).
2. Demo lands, wave closes, no cleanup ADR authored.
3. Next wave sees the names in PRELUDE, writes examples against them,
accumulates usage at call sites.
4. By the time an audit catches it, migration cost is non-trivial
(50-100+ call sites across examples, fixtures, skills).

This is not a logic bug. It is a **design-surface contamination bug**: the
internal implementation vocabulary bleeds into the user vocabulary.

## Root cause

Two independent dynamics compound:

- **Sprint-tempo bias**: demo-ware ships the shortest path to visible output.
Monomorphic names (`print_int`) are that shortest path. No gate asks "is
this user-facing?" at demo time.
- **Accumulation drift (F1 Sediment)**: wave-2 onward does not re-examine
whether PRELUDE entries are source-face intentional. Each usage is another
call site, each call site raises the migration cost, which raises the
perceived risk of cleanup, which delays the cleanup further.

## Why this is critical for ADSD / LLM-first projects

Per CLAUDE.md §2.5 (LLM-first design principle, constitutional north star):

> Cobrust is the language LLM agents write correctly on the first try.

The **training-data-overlap rule** is the key binding:

- LLMs trained on Python/Rust write `print(x)` — one of the highest-frequency
call patterns in any Python corpus.
- `print_int(x)` appears in neither Python nor Rust training data. It is a
Cobrust-internal artifact.
- Result: LLM generates `print(x)` → `NameError: print_int is not defined` →
LLM confused by gap between prior and actual API → corrective loop consumes
tokens and latency for zero semantic value.

Every type-suffix source-face name is a **friction multiplier on every future
LLM-driven generation session** against the codebase.

## Empirical evidence (Cobrust 2026-05-19/20)

**Affected names (Phase E demo era, Cobrust 2026-04):**

| Source-face name (wrong) | Should be | Internal C-ABI symbol |
|--------------------------|--------------|--------------------------|
| `print_int` | `print` | `__cobrust_print_int` |
| `print_str` | `print` | `__cobrust_print_str` |
| `print_bool` | `print` | `__cobrust_print_bool` |
| `print_float` | `print` | `__cobrust_print_float` |

**Call-site count at cleanup (ADR-0064 sprint):**
- 133 `.cb` call sites + ~200 Rust inline-source test strings refactored.
- Net source delta ~333 LOC across 4 cleanup commits.

**Sprint commit references (Cobrust main):**
- `c73be4e` — PRELUDE table: remove `print_int`/`str`/`bool`/`float` source-face entries
- `b51b907` — polymorphic `print()` dispatch in `synth_call` + codegen monomorphization
- `5e87e77` — mechanical refactor: 133 `.cb` call sites + Rust inline strings → `print()`
- `46c0946` — Phase 4 fix: `Ty::None` callret locals must dispatch to `__cobrust_println_int`
not str-buf (caught by regression during cleanup)

**Ratified at:** commit `46c0946` (feature/0064-print-mono, rebased on main 2026-05-20).

**Post-ratification state:**
- Zero `print_int`/`print_str`/`print_bool`/`print_float` call-sites in any `.cb` file
under `examples/`. Confirmed via `grep -rEn "print_(int|str|bool|float)\(" examples/ --include="*.cb"` → empty.
- LC-100 12/12 maintained (including LC-05 which caught a `Ty::None` dispatch bug exposed by cleanup).
- 5+ integration tests passing for polymorphic `print`.

## Detection rule (CI gate candidate)

For every function listed in the PRELUDE source-face table:

> If the function name matches `<verb>_<type>` where `<type>` ∈
> `{int, str, bool, float, list, dict, set, tuple, ...}`, file an audit issue:
> "should this be polymorphic in source?"

```
for name in PRELUDE.source_face_names:
if re.match(r'^[a-z_]+_(int|str|bool|float|list|dict|set|tuple)$', name):
emit_audit_warning(
f"PRELUDE name '{name}' matches type-suffix pattern — "
"verify it is source-face intentional, not codegen-internal leakage"
)
```

Candidate for a lint pass in CI. Zero false-positive risk on a well-curated PRELUDE:
intentional type-suffix names are rare; any hit deserves a justification comment.

## Resolution path

1. **Identify**: grep PRELUDE source-face table for `<verb>_<type>` names.
2. **Classify**: for each hit, determine whether it is source-face intentional
(user writes it) or codegen-internal (should be hidden behind a polymorphic
dispatch).
3. **Cleanup sprint**: remove the monomorphic names from PRELUDE; add polymorphic
dispatch that routes `print(x: T)` to `__cobrust_print_T` post-typecheck.
4. **Mechanical refactor**: batch-rename all call sites (mirrors LC-100 &borrow
226-site batch pattern — treat as a mechanical sprint, not a semantic one).
5. **Gate**: add CI lint to prevent re-introduction.

## Related findings

| Finding | Relationship |
|---------|--------------|
| F36 — fixture-name-vs-behavior drift (Cobrust F36) | Same family: wave-1 demo-ware fossilizes without audit checkpoint |
| F37 — silent-rot-on-accepted-debt (Cobrust F37) | Same family: accepted debt silently accumulates usage; no discipline at debt boundary |
| F1 — Declared rules without enforcement | Parent family: "design surface should be polymorphic" is common sense; no enforcement gate exists at PRELUDE authorship time |
Original file line number Diff line number Diff line change
@@ -0,0 +1,141 @@
---
catalogue_id: F42
title: "Device-identifying names leaked into git history and repo files via sub-agent memory read-through"
family: F1-Sediment (opsec-boundary sub-form)
severity: P1 (privacy / opsec — identifying info in public repo)
status: ratified_2026-05-19
empirical_project: Cobrust pre-publish privacy sweep (2026-05-19)
cobrust_local_id: F39 (f39-device-name-leakage-in-commits.md)
date_ratified: 2026-05-19
cobrust_sha: d012df9
resolution: git filter-repo force-rewrite + rename + CI grep gate (Option A)
discovered_by: P10 CTO emergency audit — pre-publish privacy sweep
---

# F42 — Device-identifying names leaked into public artifacts via sub-agent memory read-through

## Pattern

Sub-agents writing commit messages, ADRs, and module documentation frequently embed
**device-identifying strings** sourced from operator memory references — hostnames,
IP addresses, SSH port numbers, GPU model SKUs, OS kernel versions, user login names —
into public-repo artifacts that land on `main`. Pre-publish, this leaks operator
infrastructure opsec into a soon-public repository.

The mechanism is a **memory read-through without opsec boundary**:

1. Operator stores concrete connection info in agent memory (e.g., `reference_x86_workstation.md`)
so they can reconnect quickly between sessions.
2. Sub-agents reading that memory treat the literals as **publishable grounding detail**
(it "contextualizes" the work) rather than **opsec-sensitive material**.
3. No pre-write rule prohibits embedding these strings. CI does not grep commit/diff text
for banned patterns.
4. Strings accumulate in commit messages (not trivially rewriteable in a normal git flow),
ADRs, workflow files, and architecture pages over many sprint sessions.

## Root cause

This is F1-family: the rule "don't embed infrastructure literals in publishable text"
exists as common sense, but no enforcement gate verifies it at write time or CI time.

Two independent contributing factors:

- **Memory-to-artifact boundary ambiguity**: agents correctly use memory to orient
themselves. The distinction "this literal is ops-private" vs. "this literal is
publishable" is not enforced at the tool boundary. Any memory read can silently
propagate private literals into any subsequent write.
- **Commit message irreversibility**: file contents can be edited in place; commit
messages require history rewrite. The longer the leak persists, the more invasive
the remediation (force-push, filter-repo, coordinated branch cleanup).

## Empirical evidence (Cobrust 2026-05-19, pre-rewrite)

**Quantified leak inventory:**
- **31 commit messages** across `main` + feature branches contained one or more of:
`DG-Workstation-2x3090`, `wubingjing`, `112.74.60.44`, `port 10040`, `Linux 6.x kernel`.
- **18 repo files** carried the same strings inline:
- 8 ADRs
- 2 architecture pages
- 4 test files
- 1 module documentation page
- 1 spike document
- 1 GitHub Actions workflow file
- **Workflow filename** `.github/workflows/workstation-gates.yml` itself hinted at
the host identity tier via its name.

**Remediation executed (Cobrust 2026-05-19):**
- `git filter-repo --replace-text` + `--replace-message` rewrote all branches,
mapping device-identifying strings to neutral placeholders:
- hostname → `<self-hosted-runner>`
- user login → `<runner-user>`
- IP address → `<runner-ip>`
- SSH port → `<runner-port>`
- GPU model SKU → `<gpu-host>`
- OS kernel version → `linux x86_64 host`
- 18 leftover worktree branches deleted (local + remote).
- Workflow renamed to `.github/workflows/self-hosted-gates.yml`.
- Force-pushed `main` with rewritten history (solo dev, no external consumers,
operator explicit authorization).
- Ratified at commit `d012df9`.

## Detection rule (CI gate — open as of ratification)

Add a pre-commit / CI grep gate that fails the build if any banned literal reappears:

```bash
# .github/workflows/opsec-lint.yml (or pre-commit hook)
BANNED_PATTERNS=(
"DG-Workstation" # specific host class name
"wubingjing" # specific user login
"[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+" # any IPv4 (catch-all)
"port [0-9]{4,5}" # explicit SSH port references
"RTX [0-9]{4}" # GPU model SKU
"Linux [0-9]+\.[0-9]+" # minor kernel version
)
for pattern in "${BANNED_PATTERNS[@]}"; do
if git diff --cached | grep -qE "$pattern"; then
echo "OPSEC LINT FAIL: banned pattern '$pattern' in staged diff"
exit 1
fi
done
```

Apply to commit messages via `commit-msg` hook as well as file content via `pre-commit`.

## Going-forward rule

When writing commit messages, ADRs, module docs, or any other publishable artifact,
**never** embed:

- Specific hostnames (use `<self-hosted-runner>` or `runner host`).
- Specific user logins (use `<runner-user>` or `the operator account`).
- IP addresses (use `<runner-ip>` or `the runner endpoint`).
- SSH port numbers (use `<runner-port>` or `the SSH port`).
- GPU model SKUs as tier identifiers (use `<gpu-host>` or describe capability: "x86_64 GPU host with CUDA").
- OS minor version + kernel version (use `linux x86_64 host`).

Initials-only references (e.g., "DG verify", "on DG") are acceptable when the
two-letter token does not uniquely identify a public-facing artifact.

## Resolution path

If the leak has already accumulated:

1. **Audit**: `git log --all --oneline | xargs -I{} git show {} -- | grep -E "<pattern>"` to
quantify the blast radius across all branches and files.
2. **Triage**: separate file-content leaks (patchable in place) from commit-message leaks
(require filter-repo rewrite).
3. **Rewrite**: `git filter-repo --replace-text replacements.txt --replace-message replacements.txt`
where `replacements.txt` maps each banned literal to its neutral placeholder.
4. **Branch cleanup**: delete worktree branches that carried unrewritten history.
5. **Gate**: add CI opsec lint as described in the Detection Rule above.
6. **Memory cross-link**: add an in-repo finding file so future agents resuming without the
operator's memory entry still have the rule available.

## Related findings

| Finding | Relationship |
|---------|--------------|
| F1 — Declared rules without enforcement | Parent family: opsec boundary exists as common sense, but no enforcement gate at write time |
| F43 — SPOF heavy-build host (Cobrust F40) | Same origin: over-reliance on a named private host created both the opsec exposure (F42) and the availability failure (F43) |
| F35 — commit-message scope drift (upstream catalogue) | Adjacent: commit messages carry unintended context; this finding is the opsec variant |
Loading