Claude/identify nanobot changes qpm8 z#28
Closed
chancsc wants to merge 8 commits into
Closed
Conversation
Merged changes from head repo
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
nomic-embed-text produces cosine ~0.75 for same-domain different-fact pairs (e.g. two butterfly survey records at different locations). The old threshold of 0.70 let cosine override token similarity, incorrectly classifying distinct insights as UPDATE and replacing the original. Raising to 0.85 ensures cosine only confirms deduplication when texts are genuinely near-identical. Adds regression test with controlled 0.75-cosine fake embeddings. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
ContentSimilarity (bidirectional max) was too sensitive for formulaic scientific records: a Raub butterfly entry sharing the species name and standard phrasing with a Kinabalu entry produced tokenSim=0.5, crossing the UPDATE threshold and replacing the original. Jaccard (|A∩B|/|A∪B|) penalises texts that share domain vocabulary but have many distinct tokens (different facts). Same-domain different-location pairs now score ~0.28, falling below the 0.5 ADD threshold. Genuine one-word-change updates (SQLite→PostgreSQL) still score ~0.6 → UPDATE. ContentSimilarity is unchanged — bidirectional max remains correct for recall and keyword search. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ity>=0.7
Two bugs caused CONFLICT false positives on butterfly survey data:
1. "not" in negationWords fires on virtually all scientific text
("species not previously recorded", "not endemic to region").
Removed: only multi-word state-change phrases remain as signals.
2. Negation check fired at similarity>=0.5. At borderline similarity,
texts share domain vocabulary without being about the same subject.
Now only checked when similarity>=0.7.
Also updates guide.md: PDF/external-document facts must use --no-diff
since each document is a distinct authoritative source.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Upstream changes included: - Nanobot integration officially merged (PR mnemon-dev#24) — our contribution - Codex integration added (PR mnemon-dev#27) - v0.1.5 dedup fixes merged (PR mnemon-dev#25) — our contribution - v0.1.6 release notes Conflict resolution: - cmd/setup.go, assets/assets.go: took upstream (adds Codex alongside Nanobot) - SKILL.md: took upstream (our reviewed version with softened guardrail) - README.md: kept upstream harness wording and Vision paragraph
Contributor
Author
|
Mistake |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
sync code