Skip to content

fix(search): preserve malformed names (no collapse) + --normalize-malformed switch#488

Merged
githubrobbi merged 3 commits into
mainfrom
fix/malformed-name-display
Jun 28, 2026
Merged

fix(search): preserve malformed names (no collapse) + --normalize-malformed switch#488
githubrobbi merged 3 commits into
mainfrom
fix/malformed-name-display

Conversation

@githubrobbi

Copy link
Copy Markdown
Collaborator

Preserve malformed-named entries + --normalize-malformed switch

Fixes the G-drive parity mismatch and makes corrupt (ill-formed UTF-16) NTFS
names first-class in output. Three reviewable commits:

1. Don't collapse malformed paths (the bug)

A directory whose name is an unpaired UTF-16 surrogate was stored byte-faithfully
and flagged malformed, but the lossy name() accessor returned "" for it, so
the path resolver pushed an empty segment — …\evil�.exe\report.txt collapsed to
…\report.txt, re-parenting children to the volume root and emitting duplicate
parent rows. New CompactRecord::name_display() renders ill-formed names lossily
and both resolvers preserve the segment (and resolve_path_inner now terminates
the parent walk on the lossless bytes so a crooked dir can't truncate the path
beneath it).

2. One marker per code unit (match C++)

from_utf8_lossy emits one per byte��� for a single bad code unit —
where the reference C++ tool, Everything, and file managers show one. The new
renderer walks the WTF-8 and replaces each offending code unit with one marker,
so the default output now matches C++ exactly (which is what lets the G-drive
malformed rows reconcile on their own).

3. --normalize-malformed (the switch)

Opt into greppable, reversible <BAD:HHHH> markers (HHHH = the bad code unit,
e.g. <BAD:DCFF>) instead of , for downstream tooling that needs to spot,
parse, or round-trip corrupt entries by path string. < / > are invalid in
NTFS names so the marker can't collide; the hex keeps siblings distinct and is
reversible. Threaded end to end: CLI flag → SearchParams (backward-compatible
wire) → daemon → SearchFilters → a MalformedRender param on resolve_path*.

Performance

Zero hot-path cost throughout: valid names take Cow::Borrowed; only the rare
malformed name allocates, and only at render time. Filtering corrupt entries
stays on the existing --malformed filter.

Tests

Crooked dir segment preserved (both resolvers agree, path flagged malformed);
crooked leaf still enumerated; lossy = evil�.exe, normalized =
evil<BAD:D800>.exe; end-to-end normalized path through the resolver; wire
round-trip + omitted-field backward-compat; CLI flag parsing. The WI-4.4 tests
moved to a compact_tests/malformed.rs submodule (800-LOC policy, no exception).

🤖 Generated with Claude Code

githubrobbi and others added 3 commits June 28, 2026 05:57
…ollapse)

A directory whose NTFS name is ill-formed (unpaired UTF-16 surrogate) is
stored byte-faithfully and flagged `malformed`, but the lossy `name()`
accessor returned "" for it. The path resolver pushed that empty segment,
so `…\evil�.exe\report.txt` collapsed to `…\report.txt` — re-parenting
children to the volume root and emitting duplicate parent rows. That is the
bulk of the G-drive parity mismatch vs the reference C++ tool, which renders
the lossy name.

Add `CompactRecord::name_display() -> Cow<str>`: valid names borrow at zero
cost; an ill-formed name renders lossily (U+FFFD, like C++) instead of
emptying. Repoint both path resolvers to push `name_display`, and fix
`resolve_path_inner` to terminate the parent walk on the lossless bytes (it
was breaking on the empty lossy name, truncating everything beneath a
crooked directory).

Hot path unaffected: well-formed names take `Cow::Borrowed`; only the rare
malformed case allocates. Filtering stays on the `malformed` flag.

Tests: a crooked directory segment is preserved in the resolved path (both
resolvers agree, path flagged malformed), and a crooked-leaf file is still
enumerated in search results with its lossy path — the two cases that
collapsed / went missing on the G drive.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

(The WI-4.4 malformed tests are extracted into a `compact_tests/malformed.rs`
submodule so `compact_tests.rs` stays under the 800-LOC policy.)
…Normalized)

`name_display`'s lossy path used `from_utf8_lossy`, which emits one U+FFFD per
BYTE — so a single ill-formed UTF-16 code unit (a 3-byte WTF-8 surrogate)
showed as `���` where the reference C++ tool, Everything, and file managers
show a single `�`. Render per CODE UNIT instead: walk the WTF-8, pass valid
runs through, and replace each offending unit with one marker. The default
(Lossy) now matches C++ exactly, which also lets the G-drive malformed rows
reconcile on their own.

Add `MalformedRender { Lossy, Normalized }` + `CompactRecord::name_display_with`.
Normalized replaces each bad code unit with a greppable, reversible
`<BAD:HHHH>` sentinel (HHHH = the code unit, e.g. `<BAD:DCFF>`). `<` / `>` are
invalid in NTFS names so the marker can never collide; the hex keeps two
malformed siblings distinct and the true name recoverable. This is the
rendering primitive the `--normalize-malformed` switch selects; the hot path is
unchanged (valid names borrow, only malformed allocate).

Tests: lossy → `evil�.exe` (one marker), normalized → `evil<BAD:D800>.exe`,
valid names byte-identical under both modes.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Thread the malformed-render mode end to end so corrupt (ill-formed UTF-16)
names can be surfaced as greppable, reversible `<BAD:HHHH>` markers instead of
the default `�` — for downstream tooling that needs to spot, parse, or
round-trip corrupt entries by their path string. (Filtering corrupt entries
stays on the existing malformed filter; this is display-only.)

* uffs-core: `SearchFilters.normalize_malformed` + `malformed_render()`; a
  `MalformedRender` param threaded through `resolve_path*` so the resolved path
  and the name column pick the render mode. Hot path unchanged — valid names
  still borrow, only malformed allocate.
* uffs-client: `SearchParams.normalize_malformed` (serde `#[serde(default)]`,
  backward-compatible wire) + the `--normalize-malformed` arm in
  `from_cli_args` (the parser uffs-cli delegates to).
* uffs-daemon: maps the request flag onto `SearchFilters` for the search
  output path; info / aggregate output stays on the default lossy render.

Also drops the now-redundant `name` arg from `build_row_cached` (it is exactly
`rec.name(...)`, derivable from the `rec` it already receives).

Tests: end-to-end normalized path through the resolver
(`C:\evil<BAD:D800>.exe\report.txt`), wire round-trip + omitted-field
backward-compat, and CLI flag parsing.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@githubrobbi githubrobbi enabled auto-merge June 28, 2026 15:53
@githubrobbi githubrobbi added this pull request to the merge queue Jun 28, 2026
Merged via the queue into main with commit 7b2e8bc Jun 28, 2026
21 checks passed
@githubrobbi githubrobbi deleted the fix/malformed-name-display branch June 28, 2026 16:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant