Skip to content

Make MCP server registration non-destructive and format-preserving.…#386

Merged
inkeep-oss-sync[bot] merged 1 commit into
mainfrom
copybara/sync
Jun 30, 2026
Merged

Make MCP server registration non-destructive and format-preserving.…#386
inkeep-oss-sync[bot] merged 1 commit into
mainfrom
copybara/sync

Conversation

@inkeep-oss-sync

Copy link
Copy Markdown
Contributor

Make MCP server registration non-destructive and format-preserving. ok init and the desktop reclaim sweeps now only ever add OpenKnowledge's own entry to a harness config: comments, formatting, key order, value types, and byte-level encoding (BOM, line endings, trailing newline) are preserved. A config that can't be parsed safely — invalid JSON/TOML, a duplicate server block, or an oversized file — is left byte-for-byte untouched and reported as left unchanged (<reason>) rather than being reset or rewritten. Codex config.toml is edited through a new napi-rs toml_edit addon so 64-bit integers and microsecond datetimes are no longer mis-flagged, and symlinked configs are written through to their real target instead of being replaced.

* docs(open-knowledge): spec for harness MCP-config write safety

Plan to fix the codex-config-trash data-loss bug across all five coding
harnesses (Claude Code, Claude Desktop, Cursor, OpenCode, Codex). OK's
weak parsers (smol-toml, JSON.parse) throw on valid configs, and the
every-launch repair sweep then resets the file; lossy round-trips also
strip comments and retype values.

The spec covers a napi-rs Rust addon porting Codex's toml_edit config
editor (insert-only, format-preserving) for Codex TOML, a surgical
jsonc-parser editor for the four JSON harnesses, a non-destructive
classify/repair path, symlink write-through, and BOM/EOL fidelity.
Grounded by reproductions plus eleven evidence files.

Implementation lands in follow-up commits on this branch.

* chore(md-conformance): refresh ng-anchors catalog file counts

Pre-existing baseline drift, independent of any feature work. The committed ng-anchors-catalog.json recorded productionFileCount 83 and testFileCount 179, but the deterministic enumeration over packages/core/src/markdown, packages/core/src/extensions, and the test globs now counts 84 and 180. A source/test file was added without regenerating the catalog, leaving the freshness gate in bun run check red on the baseline. Regenerated via enumerate-ng-anchors; only the count metadata changed, no anchors or slugs.

* [US-001] non-destructive MCP-config classification: split corrupt into absent + decline

classifyExistingMcpEntry no longer collapses blank and unparseable configs into one 'corrupt' kind. Blank/whitespace/0-byte files now classify as 'absent' (safe to create into); a present non-empty config OK cannot parse or read classifies as 'decline' carrying a bounded reason enum (McpDeclineReason), never the raw parser message. A half-written file read mid-write by a concurrent harness classifies as decline, not creatable-blank.

The McpEntryClassification union keeps four kinds (absent, no-entry, present, decline). The three classification consumers (claude-readiness, mcp-wiring, project-mcp-reclaim) move to the decline kind; their reclaim disposition is unchanged here and becomes non-destructive in the next story.

Tests: blank/whitespace to absent, invalid JSON/TOML to decline with a bounded reason, truncated JSON/TOML to decline, and a declined config is left byte-unchanged by classify.

* [US-002] non-destructive desktop MCP reclaim: delete .broken rename

Both desktop reclaim sweeps now leave a present, unparseable harness config byte-untouched and skip registering, emitting a bounded mcp-config-decline signal {scope, surface, editorId, reason} instead of renaming it to .broken-<ts> and writing a fresh near-empty config.

mcp-wiring.ts startup sweep drops the rename + fresh-write branch and the corruptBackupFailures/now/fs state. project-mcp-reclaim.ts drops moveCorruptAside, the backup DI, and the reclaimed-from-corrupt status. Both push a decline disposition and emit the signal. The bounded decline-event builder is wired and exported from index.ts (McpConfigDeclineInput kept module-local).

A reset-repro regression (real fs + real classifyExistingMcpEntry on an i64 TOML config) asserts the file stays byte-unchanged with no .broken sidecar and no write dispatched. The project-reclaim decline tests are updated to the non-destructive behavior and an AC3 create-into-blank test is added. repair-mcp-configs.ts was already non-destructive and is unchanged.

* [US-003] capable JSONC parse on the read/classify path

Swap the JSON read/classify path from JSON.parse to jsonc-parser so valid
harness configs are no longer false-flagged unparseable and reset:

- readJsonConfig and classifyExistingMcpEntry now parse via a tolerant JSONC
  scanner (line and block comments, trailing commas) and forgive the lone
  InvalidSymbol a leading UTF-8 BOM produces, so a comment-rich or BOM-prefixed
  config classifies by its real content.
- classify surfaces a duplicate top-level container key as a decline
  (duplicate-container) instead of silently editing whichever block the value
  parse happens to keep. The check is keyed to each harness's real container
  key, so a repeated unrelated sibling key stays no-entry.
- The JSON branch parses the bytes classify already read (no second read) and
  shares classifyContainer with the TOML branch; TOML keeps its current parser.
- Add jsonc-parser to tsdown alwaysBundle so the bundled CLI resolves it inside
  the packaged app.

Add classify unit tests for JSONC comments, trailing commas, leading BOM,
duplicate container (cursor and opencode), and the unrelated-sibling-key case.

* [US-004] surgical comment-preserving JSON/JSONC MCP write editor

Replace the whole-file JSON.stringify spread-merge in the write spine with a
jsonc-parser modify/applyEdits upsert that touches only OK's own entry, so a
registration write preserves the user's comments, formatting, key order, and a
leading BOM in their harness config instead of reflowing the whole document.

- upsertJsonMcpConfig: absent/blank create fresh; present+parseable upsert in
  place; oversize (>10 MiB), duplicate container key, and unparseable all
  decline and leave the file byte-unchanged. Idempotent: an already-current
  entry short-circuits to a byte-identical no-op.
- New 'declined' EditorMcpResult action carrying a bounded decline reason, so a
  guest-ownership leave-untouched is neither a write nor a failure. Threaded
  through formatInitResult, the desktop wiring mirror + logging, and the CLI
  repair outcome.
- Pre-write backup of the original bytes on a present, parseable write; size
  gate runs before the parse+modify so a multi-MB file costs only a read.
- Resolve jsonc-parser to its ESM entry in tsdown so its submodule loads inline
  in the bundle. Its UMD main left bare require('./impl/*') calls that crashed
  the desktop-loaded dist; a bundled-load integration test guards the fix.

TOML keeps the whole-file path for now; the format-preserving native write
lands separately.

* [US-005] napi-rs native-config crate + capable TOML classify + load-probe fallback

Add packages/native-config, a Rust/napi addon wrapping toml_edit, and wire its
capable parse into the harness-config classify path so a valid Codex config
carrying a 64-bit integer or a microsecond datetime is no longer mis-flagged
unparseable (the enabler of the reported config reset). A synchronous
load-probe factory returns the native engine when the .node loads and a JS
smol-toml fallback otherwise; on the fallback the write spine declines a
present non-trivial TOML config rather than risk a lossy whole-file rewrite,
and only creates into an absent or blank file.

Crate: pure N-API-free parse/JSON-projection module (accepts i64 past 2^53 and
microsecond datetimes, errors on genuinely-malformed input) with 6 cargo tests,
a thin napi 3 wrapper, and a turbo test task that runs cargo test so the gate
covers it. Engine: src/native/toml-config-engine.ts probes and falls back the
way createTokenStore does, with a loader parameter for deterministic tests.

The classify TOML branch now reads through the engine using the bytes already
read. Updated the reset-repro regression: under the capable parser the i64
config classifies as no-entry, so the no-create sweep leaves it untouched
rather than declining, and it is still byte-unchanged with no .broken sidecar.

Packaging: cli workspace dependency, tsdown neverBundle, biome and knip ignores
for the generated loader, notices regen. The packaged-desktop extraResources
copy and the cross-platform prebuild matrix are tracked as their own work; the
packaged fallback stays a non-destructive decline until then.

* [US-006] insert-only single-key toml_edit upsert/remove engine in native-config

* [US-007] port Codex edit-test conformance floor + symlink resolution + Codex attribution

* [US-008] wire Codex TOML write through the native addon + BOM/EOL/trailing-newline wrapper

Route writeEditorMcpConfig's TOML branch through upsertTomlMcpConfig: the
native toml_edit insert-only upsert touches only OK's own entry, and a thin
wrapper captures and re-applies the source file's leading BOM, dominant EOL,
and trailing-newline state (toml_edit strips the BOM, normalizes CRLF to LF,
and always emits a trailing newline). A present, parseable config gets a
pre-write backup before the atomic write; an unchanged config is a
byte-identical no-op. On the JS fallback a present non-blank config declines
rather than a lossy whole-file rewrite.

The engine becomes a discriminated union on backend so only the native arm
carries upsertEntry; the fallback can only parse. OK's own string values now
serialize as single-line escaped basic strings (matching the prior smol-toml
writer) so every newline in the document is structural, which makes the
blanket CRLF re-apply provably safe and keeps OK's resolver chain LF-internal
on any line-ending convention. The native upsert reports whether OK's entry
already existed so the spine labels register-vs-update without re-parsing.

* [US-009] symlink write-through on the harness write path

Resolve a (possibly symlinked) harness config to its real write target so a
dotfile-managed (stow/chezmoi) config is written through to its target instead
of being replaced by a regular file and orphaning the dotfiles-repo copy.

New cli/src/native/symlink-resolve.ts: resolveHarnessWritePaths prefers the
native conformance-tested resolver and degrades to a pure-JS mirror of Codex's
resolve_symlink_write_paths. Symlink traversal has no capability gap, so the JS
mirror keeps write-through working for all five harnesses and on platforms with
no prebuilt binary. writeEditorMcpConfig resolves inside the file lock and
writes to the resolved target; a cyclic chain collapses to a regular file,
breaking the link. Atomicity and the cross-process lock are unchanged; core
stays native-free.

Tests run one contract suite against both the JS and native resolvers, and a
real-spine write-through regression asserts the symlink survives and the real
target received the entry (JSON and TOML), plus cycle-break.

* [US-010] package native-config addon into the desktop bundle

The Codex TOML harness write goes through the native-config toml_edit addon
from two runtime consumers in a packaged build: the bundled CLI subprocess
and the desktop main process (the MCP repair sweep calls writeUserMcpConfigs
in-process). Mirror the @napi-rs/keyring packaging so both can resolve it.

- Declare @inkeep/open-knowledge-native-config as a desktop dependency so
  electron-builder includes it in app.asar for the in-process main consumer.
- extraResources copies the addon onto the bundled CLI's own resolution path
  (cli/node_modules), filtered to the napi loader, types, manifest, and the
  platform .node so the Rust target/ tree never ships. Unlike keyring this is
  one workspace package whose binary sits beside its loader.
- asarUnpack the addon package dir so the loader and the .node it requires
  stay co-located on the real filesystem.
- Extend the forcing-function packaging test: the addon now ships, so it
  leaves KNOWN_UNCOVERED; add guards that the extraResources rule ships both
  loader and binary, that asarUnpack covers it, and that the build artifacts
  exist after a build.

When the binary cannot ship or load the TOML write degrades to a
non-destructive decline, never a lossy write.

This is the packaging sub-piece of US-010. The cross-platform prebuild matrix
(release CI) and the packaged-DMG smoke remain, both gated.

* [US-011] consolidated cross-harness FR1 acceptance matrix + surgical-text counterfactual guard

* [US-010] fix knip dependency tracing for the native addon

Remove the now-stale cli ignoreDependencies entry for
@inkeep/open-knowledge-native-config (knip traces it via the native
binding test requires), and add a desktop ignoreDependencies entry (the
desktop main reaches the engine through the bundled CLI's require factory
at runtime, so there is no static specifier for knip to follow, but the
dependency is declared for the electron-builder extraResources copy).

* [US-010] bundle native-config prebuilt binaries into the CLI + prebuild CI

Ship the toml_edit addon to all platforms via Option 3 (bundle the
prebuilt binaries into the CLI tarball; native-config stays private, no
per-platform optional-dependency packages):

- prebuild CI: a standalone 8-target napi-rs matrix workflow builds one
  .node per target (darwin arm64/x64, linux gnu/musl arm64/x64, windows
  msvc arm64/x64) and a combine job asserts all 8 are present
- CLI build copies the napi loader + every present .node into
  dist/native/ (shipped via files:[dist]); a CommonJS package.json beside
  the loader keeps Node from mis-parsing it as ESM under the CLI's
  type:module tree
- a shared resolver tries the bundled dist/native loader first, then the
  workspace package, then the existing non-destructive fallback
- THIRD_PARTY_NOTICES gains a maintained, Cargo.lock-checked attribution
  section for the addon's statically linked Rust crates
- AC5 committable smoke: a verifier that loads the bundled addon from the
  packaged layout and round-trips it, a driver test, and a runbook (the
  live in-Electron-process step is deferred to a signed-DMG QA run)

* [US-010] wire native-config prebuilt binaries into the npm release

Add a resilient release.yml step that stages the latest successful
native-config-prebuild bundle into packages/native-config/ before the
CLI build, so the published npm CLI carries all eight platform binaries
rather than only the release runner's host target. Warn-not-fail: a
missing bundle ships host-only and non-host platforms use the
non-destructive smol-toml fallback, so it can never make a release
worse than the host-only status quo. Binaries are ABI-stable; the
latest run's set is reused until native-config changes.

Also strip two branch-introduced process markers from source comments
(comment-discipline): the decision reference in the Rust conformance
test and the story reference in the smoke driver header.

* fixup! local-review: baseline (pre-review state)

* fixup! local-review: baseline (pre-review state)

* fixup! local-review: address findings (pass 1)

* fix(open-knowledge): native-config to devDependencies (npm install fix)

The published CLI listed the private, never-published
@inkeep/open-knowledge-native-config in runtime dependencies. At publish,
workspace:* rewrites to the on-disk 0.0.0, so npm install @inkeep/open-knowledge
would try to resolve that from the registry, 404, and abort for every user. Move
it to devDependencies alongside the bundled siblings -app/-core/-server; the
addon still ships in dist/native/ and loads dist-relative, and only tests
bare-require it. Validated in reports/bundled-native-addon-dep-placement.

Also append post-ship corrigendum breadcrumbs to SPEC D14/AC5 and the
scope/failure-mode/non-goal references: the additive-path pre-write backup was
removed entirely. It was write-only (nothing reads .ok-backup, so no recovery)
and a secret-disclosure footprint; EC5 concurrent-truncation is covered by
D13/AC7 truncation-to-decline plus the atomic tmp+rename.

* feat(open-knowledge): pin fallback non-destruction + stable-release binary gate

Four changes hardening MCP-config write safety and the native-config release path:

- mcp-wiring.test.ts: add a sibling regression test in the reset-repro block
  that forces the JS smol-toml fallback engine (the path every off-macOS /
  no-prebuilt-binary user runs) and proves the boot sweep leaves a valid i64
  Codex config byte-identical: no fresh write, no .broken-* sidecar, only a
  bounded mcp-config-decline signal. Drives the real classify through the cli
  surface; imports classify from cli source so the engine override binds to the
  same instance the classify reads.

- release.yml: hard-fail a stable @latest cut when fewer than 8 native-config
  binaries stage or the prebuild run is missing, while beta keeps the resilient
  warn-and-proceed path. A stable publish silently shipping host-only native
  binaries is a fidelity regression. The staged count is written to the job
  summary either way.

- release.yml: verify the staged prebuild run's head commit is an ancestor of
  the release commit before trusting its binaries, so the native code shipped on
  every CLI invocation was built from source actually in the release. A
  non-ancestor degrades like the missing-binaries case (warn for beta, hard-fail
  for stable).

- native-config/src/lib.rs: note that remove_mcp_server is exposed for the
  future ok uninstall flow so it is not re-flagged as dead surface; it stays as
  the symmetric counterpart to upsert_mcp_server.

* chore(open-knowledge): regenerate md-audit byte-audit after rebase onto main

* ci(open-knowledge): use bun-install-ci.sh in native-config prebuild (bunfig cooldown)

* ci(open-knowledge): gate native-config cargo conformance tests in OK validation

native-config's 40 toml_edit conformance tests (test = cargo test) ran in
no required CI job. The per-package test loop in the open-knowledge / test
matrix cell never listed it, and native-config-prebuild.yml only builds the
addon, so the fidelity core of the MCP-config write-safety work was unguarded.
check:ok-ci-test-coverage flagged the gap.

Gate it rather than allowlist it away: add the package to the per-package
loop with a 360s budget sized for a cold cargo compile, and add a test-gated
dtolnay/rust-toolchain step plus a cargo cache so the test cell has a toolchain
(the pre-build napi build and cargo test both need cargo). The action SHA and
toolchain match release.yml and native-config-prebuild.yml.

* fix(open-knowledge): gate classify path on config size + review hardening

Close the classify-path oversize asymmetry. classifyExistingMcpEntry now
stat-gates a config past JSON_CONFIG_MAX_BYTES before reading, mirroring the
write path's oversize decline, so a history-bloated ~/.claude.json (tens of MB)
is left untouched without a full read and JSONC parse on every launch. It
returns the existing decline/oversize classification, so no caller changes are
needed (no caller branches on the reason value, and the decline telemetry event
already accepts oversize). statSync avoids pulling the file into memory just to
measure it. A RED test pins that the gate fires ahead of the parse: a valid
JSON payload that would otherwise classify no-entry now declines oversize and is
left byte-unchanged.

Also addresses the code review findings on this PR:
- cargo cache key now includes hashFiles(Cargo.lock, Cargo.toml) plus
  restore-keys in native-config-prebuild.yml and the OK validation workflow, so
  a dependency bump no longer reuses a stale crate cache
- init.ts native upsert and config-read catches emit debugNativeLoadFailure so a
  present-but-broken native addon is diagnosable under OK_DEBUG_NATIVE instead of
  collapsing silently to unparseable
- mcp-decline-event surface is narrowed from open string to a bounded union so
  the decline event stays bounded-cardinality at the type level
- symlink-resolve native catch and the readlinkSync catch emit debug traces

* fix(open-knowledge): honest fallback decline reason + CRLF JSON write test

Two re-review findings, both in this PR's own code:

- The TOML write path's JS-fallback branch declined a present config with reason
  'unparseable', but it never parses the config: the real cause is that the
  format-preserving native engine is unavailable and a fallback rewrite would be
  lossy. Add a distinct 'no-native-writer' McpDeclineReason and emit it there, so
  decline telemetry is no longer misattributed to a parse failure.
- Add CRLF coverage to the JSON surgical-write path (the TOML path already had
  it). A uniformly-CRLF config keeps its line endings byte-for-byte and the
  inserted entry is written as CRLF too, with no bare LF leaking in.

* docs(open-knowledge): ok:macos-vm CLI-on-VM validation reference + golden gotchas

Add an ok:macos-vm reference (references/cli-on-vm-validation.md) for the
build-free CLI validation path: npm pack + rsyncToVM + npm i -g, where the
prebuilt arm64 native-config .node matches the golden so the napi config
writer loads with no build-ok-desktop and no Rust toolchain. Document the
consumer-side delta (ok init writes the open-knowledge MCP entry; the general
consumer-fidelity method lives in eng:qa, loaded via /qa).

Record golden substrate gotchas in macos-vm + bake-lume-golden failure modes
(stale npm _cacache, no Rust, no timeout binary, ok init needs git), and note
inject.sh does not cover the Claude Code CLI.

* docs(open-knowledge): fix stale classify comment + CRLF JSDoc

* chore(open-knowledge): add Rust toolchain to lume golden 2026-06-28 (#2244)

* chore(open-knowledge): add Rust toolchain to lume golden 2026-06-28

Bake a Rust toolchain into the lume macOS golden so build-ok-desktop
builds the native-config napi addon in-VM, retiring the per-fork rustup
install and the build-free-only workaround.

- layer 13d: rustup + cargo + the aarch64-apple-darwin target, shimmed
  into /usr/local/bin so non-login ssh finds it
- layer 13e: sudo-purge the inherited root-owned ~/.npm/_cacache (the
  EEXIST/EACCES that broke a fork's npm i -g) and chown ~/.npm to lume
- verify.sh: check 13 asserts cargo + rustc + the aarch64 target
- promote DEFAULT_GOLDEN to ok-tahoe-base-substrate-only-2026-06-28
  (fork-extend off 06-06); update lineage in the 3 allowlisted docs
- retire the now-false "No Rust toolchain" notes
- macos-vm skill: add an Observability section (headless vs VNC, the
  password rationale, the shutdown-hang caveat)

Validated on a fresh fork: verify.sh 16/0 (Aqua healthy, Rust present)
plus a cargo build of toml_edit, native-config's core dep, in 3.98s.

* docs(open-knowledge): propagate Rust + 13e to secondary doc references

* fix(open-knowledge): exclude declined writes from startup-repair status

* test(open-knowledge): add Codex MCP tool-call QA scenario (cx-s1)

Adds a lume-QA scenario proving Codex.app invokes a real OpenKnowledge MCP
tool, asserted by the load-bearing mcp-tool-trace rung. It is the Codex
analogue of cd-s1-claude and the first scenario showing an agent, not a bare
Node client, fire a tools/call to OK.

- cx-s1-codex.ts: seeds a project, wires ~/.codex/config.toml through the
  MCP-trace proxy (published bunx server, approval_policy=never so the tool
  call is not gated behind an approval card), drives Codex over CDP, and
  asserts a tools/call frame in the trace.
- lume-qa-cx-s1-codex.e2e.ts: host wrapper (fork golden, inject auth, install
  trace proxy, run, evaluate rungs).
- codex-cdp-evals.ts: shared Codex CDP-driving primitives (wizard dismiss,
  fill and submit, chat scrape) plus a new tool-approval handler.
- ok-e2e-codex.ts: migrated to import the shared evals, removing the
  duplicated inline copies.

Manual-only (OK_LUME_QA=1 bun run lume-qa cx-s1-codex); not in CI. Validated
on the 2026-06-28 golden: 3 of 3 pass with approval_policy=never (Codex
gpt-5.4 agent calling the OK config tool).

---------

GitOrigin-RevId: afa186ece2dfb2e8384973e7db52e64e3088ec35
@inkeep-oss-sync inkeep-oss-sync Bot requested a review from nick-inkeep as a code owner June 30, 2026 06:16

@inkeep-internal-ci inkeep-internal-ci Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated approval from agents-private public-mirror-sync (run: https://github.com/inkeep/agents-private/actions/runs/28424330492). Source of truth is the monorepo; direct edits on inkeep/open-knowledge are overwritten on next sync.

@inkeep-oss-sync inkeep-oss-sync Bot merged commit 66aa227 into main Jun 30, 2026
@inkeep-oss-sync inkeep-oss-sync Bot deleted the copybara/sync branch June 30, 2026 06:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant