fix(search): preserve malformed names (no collapse) + --normalize-malformed switch#488
Merged
Conversation
…ollapse) A directory whose NTFS name is ill-formed (unpaired UTF-16 surrogate) is stored byte-faithfully and flagged `malformed`, but the lossy `name()` accessor returned "" for it. The path resolver pushed that empty segment, so `…\evil�.exe\report.txt` collapsed to `…\report.txt` — re-parenting children to the volume root and emitting duplicate parent rows. That is the bulk of the G-drive parity mismatch vs the reference C++ tool, which renders the lossy name. Add `CompactRecord::name_display() -> Cow<str>`: valid names borrow at zero cost; an ill-formed name renders lossily (U+FFFD, like C++) instead of emptying. Repoint both path resolvers to push `name_display`, and fix `resolve_path_inner` to terminate the parent walk on the lossless bytes (it was breaking on the empty lossy name, truncating everything beneath a crooked directory). Hot path unaffected: well-formed names take `Cow::Borrowed`; only the rare malformed case allocates. Filtering stays on the `malformed` flag. Tests: a crooked directory segment is preserved in the resolved path (both resolvers agree, path flagged malformed), and a crooked-leaf file is still enumerated in search results with its lossy path — the two cases that collapsed / went missing on the G drive. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> (The WI-4.4 malformed tests are extracted into a `compact_tests/malformed.rs` submodule so `compact_tests.rs` stays under the 800-LOC policy.)
…Normalized)
`name_display`'s lossy path used `from_utf8_lossy`, which emits one U+FFFD per
BYTE — so a single ill-formed UTF-16 code unit (a 3-byte WTF-8 surrogate)
showed as `���` where the reference C++ tool, Everything, and file managers
show a single `�`. Render per CODE UNIT instead: walk the WTF-8, pass valid
runs through, and replace each offending unit with one marker. The default
(Lossy) now matches C++ exactly, which also lets the G-drive malformed rows
reconcile on their own.
Add `MalformedRender { Lossy, Normalized }` + `CompactRecord::name_display_with`.
Normalized replaces each bad code unit with a greppable, reversible
`<BAD:HHHH>` sentinel (HHHH = the code unit, e.g. `<BAD:DCFF>`). `<` / `>` are
invalid in NTFS names so the marker can never collide; the hex keeps two
malformed siblings distinct and the true name recoverable. This is the
rendering primitive the `--normalize-malformed` switch selects; the hot path is
unchanged (valid names borrow, only malformed allocate).
Tests: lossy → `evil�.exe` (one marker), normalized → `evil<BAD:D800>.exe`,
valid names byte-identical under both modes.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Thread the malformed-render mode end to end so corrupt (ill-formed UTF-16) names can be surfaced as greppable, reversible `<BAD:HHHH>` markers instead of the default `�` — for downstream tooling that needs to spot, parse, or round-trip corrupt entries by their path string. (Filtering corrupt entries stays on the existing malformed filter; this is display-only.) * uffs-core: `SearchFilters.normalize_malformed` + `malformed_render()`; a `MalformedRender` param threaded through `resolve_path*` so the resolved path and the name column pick the render mode. Hot path unchanged — valid names still borrow, only malformed allocate. * uffs-client: `SearchParams.normalize_malformed` (serde `#[serde(default)]`, backward-compatible wire) + the `--normalize-malformed` arm in `from_cli_args` (the parser uffs-cli delegates to). * uffs-daemon: maps the request flag onto `SearchFilters` for the search output path; info / aggregate output stays on the default lossy render. Also drops the now-redundant `name` arg from `build_row_cached` (it is exactly `rec.name(...)`, derivable from the `rec` it already receives). Tests: end-to-end normalized path through the resolver (`C:\evil<BAD:D800>.exe\report.txt`), wire round-trip + omitted-field backward-compat, and CLI flag parsing. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Preserve malformed-named entries +
--normalize-malformedswitchFixes the G-drive parity mismatch and makes corrupt (ill-formed UTF-16) NTFS
names first-class in output. Three reviewable commits:
1. Don't collapse malformed paths (the bug)
A directory whose name is an unpaired UTF-16 surrogate was stored byte-faithfully
and flagged
malformed, but the lossyname()accessor returned""for it, sothe path resolver pushed an empty segment —
…\evil�.exe\report.txtcollapsed to…\report.txt, re-parenting children to the volume root and emitting duplicateparent rows. New
CompactRecord::name_display()renders ill-formed names lossilyand both resolvers preserve the segment (and
resolve_path_innernow terminatesthe parent walk on the lossless bytes so a crooked dir can't truncate the path
beneath it).
2. One marker per code unit (match C++)
from_utf8_lossyemits one�per byte —���for a single bad code unit —where the reference C++ tool, Everything, and file managers show one. The new
renderer walks the WTF-8 and replaces each offending code unit with one marker,
so the default output now matches C++ exactly (which is what lets the G-drive
malformed rows reconcile on their own).
3.
--normalize-malformed(the switch)Opt into greppable, reversible
<BAD:HHHH>markers (HHHH = the bad code unit,e.g.
<BAD:DCFF>) instead of�, for downstream tooling that needs to spot,parse, or round-trip corrupt entries by path string.
</>are invalid inNTFS names so the marker can't collide; the hex keeps siblings distinct and is
reversible. Threaded end to end: CLI flag →
SearchParams(backward-compatiblewire) → daemon →
SearchFilters→ aMalformedRenderparam onresolve_path*.Performance
Zero hot-path cost throughout: valid names take
Cow::Borrowed; only the raremalformed name allocates, and only at render time. Filtering corrupt entries
stays on the existing
--malformedfilter.Tests
Crooked dir segment preserved (both resolvers agree, path flagged malformed);
crooked leaf still enumerated; lossy =
evil�.exe, normalized =evil<BAD:D800>.exe; end-to-end normalized path through the resolver; wireround-trip + omitted-field backward-compat; CLI flag parsing. The WI-4.4 tests
moved to a
compact_tests/malformed.rssubmodule (800-LOC policy, no exception).🤖 Generated with Claude Code