feat(middleware): Use NER models instead of Regex for PII filtering#10160
Draft
richiejp wants to merge 10 commits into
Draft
feat(middleware): Use NER models instead of Regex for PII filtering#10160richiejp wants to merge 10 commits into
richiejp wants to merge 10 commits into
Conversation
Conversation trimming runs through the classifier model's chat template and trims by exact token count, sized to the model's n_batch which is now scaled to context so long probes can't crash the backend. Missing chat_message templates are a hard error at router build time. Router- facing factories (Embedder/Scorer/Reranker/TokenCounter) re-resolve ModelConfig per call so a model installed post-startup doesn't bind a stub Backend="" config and silently fall into the loader's auto- iterate path. New 'vector_store' backend trace recorded inside localVectorStore on every Search/Insert — including the backend-load-failure path that previously vanished into an xlog.Warn — with outcome tagging (hit/miss/empty_store/backend_load_error/find_error/insert_error/ok). Companion cleanup drops misleading similarity:0 and input_tokens_count:0 from non-hit and text-mode traces. Gallery local-store-development aliases to 'local-store' so the master image satisfies pkg/model.LocalStoreBackend lookups from the embedding cache. Misc: llama-cpp TokenizeString reads the correct 'prompt' JSON key (the original bug); ModelTokenize nil-guard; non-fatal mitm proxy startup; PII 'route_local' renamed to 'allow' with docs/UI in sync; model-editor footer no longer eats the edit area on small screens; several config-editor template/dropdown/section fixes. Tests: e2e router specs (casual/code-hint + long-conversation trim), vector_store trace specs, lazy-factory specs, gallery dev-alias resolution, Playwright trace badge + scroll regression. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com>
…dels Embedding and rerank models pool over the whole input in a single physical batch (n_ubatch). With batch left at the 512 default, the backend rejects longer inputs with "input is too large to process", silently capping a large-context embedder (e.g. 8k/32k) at 512 tokens. Size n_batch to the context for these single-pass usecases, mirroring the existing FLAG_SCORE behaviour; an explicit batch: still wins. Extracts EffectiveContextSize/EffectiveBatchSize from grpcModelOpts so the effective decode window has one home for other callers to reuse. Adds an e2e-aio regression test that embeds a >512-token input. The AIO embedding model is switched to nomic-embed-text-v1.5 (2048 context) because the previous granite model was capped at 512 tokens and could not exercise the larger batch. Assisted-by: claude-code:claude-opus-4-8 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com>
Layers an optional token-classification (NER) tier on top of the regex PII redactor for the inbound chat/messages middleware. When a model's pii.ner.model is set, RequestMiddleware (via the new pii.WithNERResolver option) resolves a detector over the shared model loader and runs RedactWithNER (regex + NER merged); without it, redaction stays regex-only, so existing four-arg call sites and regex-only stubs are unaffected. - core/config: PIINERConfig (model, min_score, default_action, entity_actions) under PIIConfig; default action is mask (safe-by- default for a PII filter). Registry entries + grandfather for the map field so the field-registry coverage test stays green. - core/application: PIINERResolver binds a token-classifier (piidetector) over the model loader, lazily — the model loads on first Detect; unknown/unconfigured names resolve to nil. - core/services/routing/pii: the middleware fails CLOSED on the NER tier — if a model has pii.ner.model set but the tier cannot run (detector errors at request time, or the model can't be resolved to a detector), the request is blocked with 503 pii_ner_unavailable and a fail-closed audit event, rather than silently downgrading to regex-only. The redactor's RedactWithNER stays fail-open (returns a best-effort regex result + error); the block policy lives in the middleware. - core/services/routing/piidetector: detector backing the NER tier. - core/backend: TokenClassify backend call (gRPC TokenClassifyRequest/ Response) + tests. - backend/python/transformers: TokenClassify now emits UTF-8 byte offsets (proto contract) instead of HF codepoint offsets, and returns the exact text slice. Fixes wrong spans on multibyte/multilingual input. llama.cpp privacy-filter arch (Phase 1, carry-patches under backend/cpp/llama-cpp/patches, applied by prepare.sh; all five apply in order and compile against the current pin 5dcb71166): - 0001 TOKEN_CLS pooling substrate (reduced subset of upstream #19725). - 0002 registers the openai-privacy-filter architecture + gguf-py arch/tensor mappings (score -> cls.output). - 0003 HF->GGUF converter (OpenAIPrivacyFilterModel, a GptOssModel subclass); validated end-to-end against OpenMed/privacy-filter- multilingual (157-tensor F16 GGUF, metadata verified). Splits the expert gate_up as concatenated halves (not gpt-oss's interleaved ::2/1::2) and writes per-dim rope_freqs.weight carrying HF's exact YaRN inv_freq (truncate=false), since ggml's shared YaRN ramp floor/ceils the correction band. - 0004 model graph + loader wiring (llama_model_openai_privacy_filter): gpt-oss MoE body as a bidirectional token classifier — no KV cache, uniform symmetric sliding-window band, attention sinks, no LM head; ends at the per-token hidden states so the framework's TOKEN_CLS pooling applies the cls.output head per token. Uses the interleaved (GPT-J) LLAMA_ROPE_TYPE_NORM layout — unlike gpt-oss's NEOX — and feeds the per-layer rope_freqs into ggml_rope_ext with ggml's YaRN ramp disabled (mscale kept via rope_attn_factor). - 0005 no-cache all-SWA mask fix (llama-graph.cpp): an encoder whose every layer is SWA leaves the full (non-windowed) attention mask unallocated; set_input now only fills a mask that actually got a buffer, else the model aborts on the first decode. Status: parity solved. The new arch matches the HF reference token-for- token against OpenMed/privacy-filter-multilingual at F16 — 12/12 argmax, full-logit cosine = 1.0, every layer's residual stream cos = 1.0 (relerr ~2e-4 = F16 rounding), including the e-mail BIOES span. Verified on the real llama-embedding binary with model-default TOKEN_CLS pooling. Root cause of the earlier attenuation was two independent RoPE bugs (NEOX vs interleaved/NORM dim-pairing, dominant; plus ggml's YaRN truncate rounding), both fixed in 0003/0004. The two parity-gated assumptions (n_swa = 2*sliding_window and the gate_up packing) are confirmed correct. Plans/integration notes under docs/plans/pii-ner-ggml. Assisted-by: claude-code:claude-opus-4-8 [Claude Code]
The devShell shipped `protobuf` (for Go proto generation) but no C++ gRPC, so `make grpc-server` in backend/cpp/llama-cpp could not locate gRPC via find_package(gRPC) and fell back to a stale, version-skewed grpc from the store (protobuf 34.1 headers vs a grpc built against 32.1), aborting on a protobuf gencode mismatch. nixpkgs builds `grpc` against the same `protobuf`, so adding it gives a self-consistent C++ stack. Docker (backend/Dockerfile.base-grpc-builder) compiles gRPC v1.65.0 / protoc v27.1 from source; the nixpkgs pair here (grpc 1.80 / protobuf 34) is newer but wire/ABI-consistent. Verified a clean cmake build of grpc-server inside `nix develop` with no manual flag overrides. Assisted-by: claude-code:claude-opus-4-8 [Claude Code]
Implements the TokenClassify gRPC primitive in the vendored llama.cpp
backend, completing Phase 2 of the PII NER tier. Mirrors Score's
direct-decode strategy (bypassing the slot/task queue under the same
conflict_guard + mutex) because it needs full control over batch output
flags, per-token logit readout, and overlapping-window stitching.
Pipeline: tokenize (o200k) with UTF-8 byte offsets -> windowed
non-causal forward -> per-token n_cls_out logits via
llama_get_embeddings_ith -> fp32 log_softmax -> constrained linear-chain
BIOES Viterbi (the model's transition biases are 0.0, so structural
constraints only) -> span assembly -> whitespace-trimmed byte spans ->
TokenClassifyEntity{entity_group, start, end, score, text}.
Windowing uses a halo of n_layer*sliding_window: a symmetric +/-128 band
per layer compounds across the 8 layers, so a token's logits depend on
+/-1024 neighbours, not +/-128 (short inputs stay a single exact
forward). Requires a TOKEN_CLS-pooling model loaded with embeddings
enabled.
Validated end-to-end against OpenMed/privacy-filter-multilingual at F16:
correct entities across English/German with byte-exact multibyte offsets
(the 2-byte U+00FC in "Muller" is spanned correctly).
Assisted-by: claude-code:claude-opus-4-8 [Claude Code]
Adds FLAG_TOKEN_CLASSIFY, mirroring FLAG_SCORE's explicit-opt-in pattern for an internal direct-decode RPC: it's declared via `known_usecases: [token_classify]`, is authoritative (HasUsecases won't paint chat/embeddings on top via the heuristic), and has no guessing heuristic. On llama-cpp Validate() rejects combining it with chat/completion (TokenClassify bypasses the slot loop and races generation) but allows embeddings, which TOKEN_CLS pooling requires. Like FLAG_SCORE this is intentionally not registered in backend_capabilities.go's UsecaseInfoMap: it has no public REST route (the PII redactor's NER tier calls TokenClassify directly), so it stays a known_usecases flag only. Assisted-by: claude-code:claude-opus-4-8 [Claude Code]
Gallery entry for the OpenMed/privacy-filter-multilingual PII NER model, converted to GGUF (F16) for the vendored llama.cpp backend. Sets backend: llama-cpp, embeddings: true, and known_usecases: [token_classify] so it loads under TOKEN_CLS pooling and is consumed via a model config's pii.ner.model seam (not a standalone chat/completion model). The uri points at localai-org/privacy-filter-multilingual-GGUF; the sha256 is the F16 artifact's real hash. The model runs only on a llama.cpp build carrying the openai-privacy-filter carry-patches in backend/cpp/llama-cpp/patches/ (the arch is not yet upstream). Assisted-by: claude-code:claude-opus-4-8 [Claude Code]
Invert the PII filtering model so detection policy lives on the NER
(token-classification) detector model itself, and consuming models just
reference detectors by name.
Schema (core/config/model_config.go):
- New top-level `pii_detection:` block on a detector model
(min_score, default_action, entity_actions) + accessors.
- Consumer `pii:` is now `{ enabled, detectors: [<model>...] }`.
- `pii.patterns` / `pii.ner` kept only as untyped deprecated shadows so
old YAMLs still parse; Validate() warns (does not fail).
Middleware (core/services/routing/pii):
- Redactor is now a stateless handle; RedactNER(ctx, text, []NERConfig)
runs every detector, unions hits, and overlap-merges (block>mask>allow).
- NERDetectorResolver returns (NERConfig, bool); the resolver reads each
detector model's pii_detection policy (NERConfigFromRaw).
- RequestMiddleware is NER-only, multi-detector, fail-closed on a
detector that can't be resolved or errors.
Regex tier fully removed: patterns.go, config.go (LoadConfig/--pii-config),
the response-side StreamFilter, the /api/pii/{patterns,test,decide,persist}
admin routes, the MCP list/test/set/persist pattern tools, and the dead
--pii-config/--disable-pii AppOptions + runtime_settings overrides. Output/
streaming redaction is dropped for now (NER is request-side only).
Cloud-proxy/MITM now runs NER on the request input (mitm/handler.go gets a
per-host []NERConfig resolved at listener start), fail-closed; the response
is forwarded unmodified.
Capability metadata: pii.detectors (model-multi-select filtered to
token_classify) + pii_detection.* registry entries; config-metadata
autocomplete gains a token_classify case; API instructions rewritten.
UI (React) is a follow-up; this is the Go side.
Assisted-by: claude-code:claude-opus-4-8 [Claude Code]
- gallery/index.yaml: privacy-filter-multilingual ships a default
pii_detection policy (validated against the GGUF's real label set —
mask everything; block PASSWORD/PIN/CVV/CREDITCARD/IBAN/BIC/BANKACCOUNT/
SSN/{BITCOIN,ETHEREUM,LITECOIN}ADDRESS).
- docs/advanced/model-configuration.md: new "PII filtering" section
(pii_detection on detector models + pii.detectors on consumers).
- docs/features/middleware.md: rewrote the PII section for the NER-only
model; dropped the removed regex pattern catalogue / endpoints / MCP
pattern tools / streaming filter.
Assisted-by: claude-code:claude-opus-4-8 [Claude Code]
React UI side of the PII redesign. - New EntityActionListEditor (entity group → mask|block|allow map editor for a detector model's pii_detection.entity_actions, with a datalist of common categories) and ModelMultiSelect (capability-filtered detector picker for a consumer's pii.detectors). - ConfigFieldRenderer: dispatch `entity-action-list` + `model-multi-select`; map `models:token_classify` → FLAG_TOKEN_CLASSIFY; drop `pii-pattern-list`. - capabilities.js: CAP_TOKEN_CLASSIFY. - Middleware page: remove the pattern catalogue, the per-pattern action editor, and the "Save to disk" persist flow; the per-model table now shows the NER detectors each config references. Removed the dead pattern-mutation state/handlers. - modelTemplates: MITM template seeds pii.detectors instead of pii.patterns. - Deleted PIIPatternListEditor. - e2e/middleware-page.spec: fixture + tests updated for the detector model; removed the PUT /api/pii/patterns test. Assisted-by: claude-code:claude-opus-4-8 [Claude Code]
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
This builds on the top of the request routing fixes PR.
It replaces the regex patterns for PII with a NER model. This allows information such as names, birth dates, addresses and so on to be redacted/blocked, which are impossible with regex. Potentially we can still have regex (or some other non-neural) via a backend as well.
We loose the PII filtering on the responses however because doing NER on a streamed response is more difficult than regex, but I think we could add it back in if needed.
Notes for Reviewers
Signed commits