feat(middleware): Use NER models instead of Regex for PII filtering by richiejp · Pull Request #10160 · mudler/LocalAI

richiejp · 2026-06-03T11:06:38Z

Description

This builds on the top of the request routing fixes PR.

It replaces the regex patterns for PII with a NER model. This allows information such as names, birth dates, addresses and so on to be redacted/blocked, which are impossible with regex. Potentially we can still have regex (or some other non-neural) via a backend as well.

We loose the PII filtering on the responses however because doing NER on a streamed response is more difficult than regex, but I think we could add it back in if needed.

Notes for Reviewers

feat(pii): inbound encoder/NER detection tier
build(nix): add C++ gRPC to the dev shell for the llama-cpp backend
feat(llama-cpp): TokenClassify RPC for openai-privacy-filter NER
feat(config): add token_classify known_usecase for the PII NER tier
feat(gallery): add privacy-filter-multilingual token-classify model
refactor(pii): NER-centric PII filter; remove the regex tier
docs(pii): gallery pii_detection policy + NER-centric docs
feat(ui): NER-centric PII editor; drop the regex pattern UI

Signed commits

Yes, I signed my commits.

Conversation trimming runs through the classifier model's chat template and trims by exact token count, sized to the model's n_batch which is now scaled to context so long probes can't crash the backend. Missing chat_message templates are a hard error at router build time. Router- facing factories (Embedder/Scorer/Reranker/TokenCounter) re-resolve ModelConfig per call so a model installed post-startup doesn't bind a stub Backend="" config and silently fall into the loader's auto- iterate path. New 'vector_store' backend trace recorded inside localVectorStore on every Search/Insert — including the backend-load-failure path that previously vanished into an xlog.Warn — with outcome tagging (hit/miss/empty_store/backend_load_error/find_error/insert_error/ok). Companion cleanup drops misleading similarity:0 and input_tokens_count:0 from non-hit and text-mode traces. Gallery local-store-development aliases to 'local-store' so the master image satisfies pkg/model.LocalStoreBackend lookups from the embedding cache. Misc: llama-cpp TokenizeString reads the correct 'prompt' JSON key (the original bug); ModelTokenize nil-guard; non-fatal mitm proxy startup; PII 'route_local' renamed to 'allow' with docs/UI in sync; model-editor footer no longer eats the edit area on small screens; several config-editor template/dropdown/section fixes. Tests: e2e router specs (casual/code-hint + long-conversation trim), vector_store trace specs, lazy-factory specs, gallery dev-alias resolution, Playwright trace badge + scroll regression. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com>

…dels Embedding and rerank models pool over the whole input in a single physical batch (n_ubatch). With batch left at the 512 default, the backend rejects longer inputs with "input is too large to process", silently capping a large-context embedder (e.g. 8k/32k) at 512 tokens. Size n_batch to the context for these single-pass usecases, mirroring the existing FLAG_SCORE behaviour; an explicit batch: still wins. Extracts EffectiveContextSize/EffectiveBatchSize from grpcModelOpts so the effective decode window has one home for other callers to reuse. Adds an e2e-aio regression test that embeds a >512-token input. The AIO embedding model is switched to nomic-embed-text-v1.5 (2048 context) because the previous granite model was capped at 512 tokens and could not exercise the larger batch. Assisted-by: claude-code:claude-opus-4-8 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com>

Layers an optional token-classification (NER) tier on top of the regex PII redactor for the inbound chat/messages middleware. When a model's pii.ner.model is set, RequestMiddleware (via the new pii.WithNERResolver option) resolves a detector over the shared model loader and runs RedactWithNER (regex + NER merged); without it, redaction stays regex-only, so existing four-arg call sites and regex-only stubs are unaffected. - core/config: PIINERConfig (model, min_score, default_action, entity_actions) under PIIConfig; default action is mask (safe-by- default for a PII filter). Registry entries + grandfather for the map field so the field-registry coverage test stays green. - core/application: PIINERResolver binds a token-classifier (piidetector) over the model loader, lazily — the model loads on first Detect; unknown/unconfigured names resolve to nil. - core/services/routing/pii: the middleware fails CLOSED on the NER tier — if a model has pii.ner.model set but the tier cannot run (detector errors at request time, or the model can't be resolved to a detector), the request is blocked with 503 pii_ner_unavailable and a fail-closed audit event, rather than silently downgrading to regex-only. The redactor's RedactWithNER stays fail-open (returns a best-effort regex result + error); the block policy lives in the middleware. - core/services/routing/piidetector: detector backing the NER tier. - core/backend: TokenClassify backend call (gRPC TokenClassifyRequest/ Response) + tests. - backend/python/transformers: TokenClassify now emits UTF-8 byte offsets (proto contract) instead of HF codepoint offsets, and returns the exact text slice. Fixes wrong spans on multibyte/multilingual input. llama.cpp privacy-filter arch (Phase 1, carry-patches under backend/cpp/llama-cpp/patches, applied by prepare.sh; all five apply in order and compile against the current pin 5dcb71166): - 0001 TOKEN_CLS pooling substrate (reduced subset of upstream #19725). - 0002 registers the openai-privacy-filter architecture + gguf-py arch/tensor mappings (score -> cls.output). - 0003 HF->GGUF converter (OpenAIPrivacyFilterModel, a GptOssModel subclass); validated end-to-end against OpenMed/privacy-filter- multilingual (157-tensor F16 GGUF, metadata verified). Splits the expert gate_up as concatenated halves (not gpt-oss's interleaved ::2/1::2) and writes per-dim rope_freqs.weight carrying HF's exact YaRN inv_freq (truncate=false), since ggml's shared YaRN ramp floor/ceils the correction band. - 0004 model graph + loader wiring (llama_model_openai_privacy_filter): gpt-oss MoE body as a bidirectional token classifier — no KV cache, uniform symmetric sliding-window band, attention sinks, no LM head; ends at the per-token hidden states so the framework's TOKEN_CLS pooling applies the cls.output head per token. Uses the interleaved (GPT-J) LLAMA_ROPE_TYPE_NORM layout — unlike gpt-oss's NEOX — and feeds the per-layer rope_freqs into ggml_rope_ext with ggml's YaRN ramp disabled (mscale kept via rope_attn_factor). - 0005 no-cache all-SWA mask fix (llama-graph.cpp): an encoder whose every layer is SWA leaves the full (non-windowed) attention mask unallocated; set_input now only fills a mask that actually got a buffer, else the model aborts on the first decode. Status: parity solved. The new arch matches the HF reference token-for- token against OpenMed/privacy-filter-multilingual at F16 — 12/12 argmax, full-logit cosine = 1.0, every layer's residual stream cos = 1.0 (relerr ~2e-4 = F16 rounding), including the e-mail BIOES span. Verified on the real llama-embedding binary with model-default TOKEN_CLS pooling. Root cause of the earlier attenuation was two independent RoPE bugs (NEOX vs interleaved/NORM dim-pairing, dominant; plus ggml's YaRN truncate rounding), both fixed in 0003/0004. The two parity-gated assumptions (n_swa = 2*sliding_window and the gate_up packing) are confirmed correct. Plans/integration notes under docs/plans/pii-ner-ggml. Assisted-by: claude-code:claude-opus-4-8 [Claude Code]

The devShell shipped `protobuf` (for Go proto generation) but no C++ gRPC, so `make grpc-server` in backend/cpp/llama-cpp could not locate gRPC via find_package(gRPC) and fell back to a stale, version-skewed grpc from the store (protobuf 34.1 headers vs a grpc built against 32.1), aborting on a protobuf gencode mismatch. nixpkgs builds `grpc` against the same `protobuf`, so adding it gives a self-consistent C++ stack. Docker (backend/Dockerfile.base-grpc-builder) compiles gRPC v1.65.0 / protoc v27.1 from source; the nixpkgs pair here (grpc 1.80 / protobuf 34) is newer but wire/ABI-consistent. Verified a clean cmake build of grpc-server inside `nix develop` with no manual flag overrides. Assisted-by: claude-code:claude-opus-4-8 [Claude Code]

Implements the TokenClassify gRPC primitive in the vendored llama.cpp backend, completing Phase 2 of the PII NER tier. Mirrors Score's direct-decode strategy (bypassing the slot/task queue under the same conflict_guard + mutex) because it needs full control over batch output flags, per-token logit readout, and overlapping-window stitching. Pipeline: tokenize (o200k) with UTF-8 byte offsets -> windowed non-causal forward -> per-token n_cls_out logits via llama_get_embeddings_ith -> fp32 log_softmax -> constrained linear-chain BIOES Viterbi (the model's transition biases are 0.0, so structural constraints only) -> span assembly -> whitespace-trimmed byte spans -> TokenClassifyEntity{entity_group, start, end, score, text}. Windowing uses a halo of n_layer*sliding_window: a symmetric +/-128 band per layer compounds across the 8 layers, so a token's logits depend on +/-1024 neighbours, not +/-128 (short inputs stay a single exact forward). Requires a TOKEN_CLS-pooling model loaded with embeddings enabled. Validated end-to-end against OpenMed/privacy-filter-multilingual at F16: correct entities across English/German with byte-exact multibyte offsets (the 2-byte U+00FC in "Muller" is spanned correctly). Assisted-by: claude-code:claude-opus-4-8 [Claude Code]

Adds FLAG_TOKEN_CLASSIFY, mirroring FLAG_SCORE's explicit-opt-in pattern for an internal direct-decode RPC: it's declared via `known_usecases: [token_classify]`, is authoritative (HasUsecases won't paint chat/embeddings on top via the heuristic), and has no guessing heuristic. On llama-cpp Validate() rejects combining it with chat/completion (TokenClassify bypasses the slot loop and races generation) but allows embeddings, which TOKEN_CLS pooling requires. Like FLAG_SCORE this is intentionally not registered in backend_capabilities.go's UsecaseInfoMap: it has no public REST route (the PII redactor's NER tier calls TokenClassify directly), so it stays a known_usecases flag only. Assisted-by: claude-code:claude-opus-4-8 [Claude Code]

Gallery entry for the OpenMed/privacy-filter-multilingual PII NER model, converted to GGUF (F16) for the vendored llama.cpp backend. Sets backend: llama-cpp, embeddings: true, and known_usecases: [token_classify] so it loads under TOKEN_CLS pooling and is consumed via a model config's pii.ner.model seam (not a standalone chat/completion model). The uri points at localai-org/privacy-filter-multilingual-GGUF; the sha256 is the F16 artifact's real hash. The model runs only on a llama.cpp build carrying the openai-privacy-filter carry-patches in backend/cpp/llama-cpp/patches/ (the arch is not yet upstream). Assisted-by: claude-code:claude-opus-4-8 [Claude Code]

Invert the PII filtering model so detection policy lives on the NER (token-classification) detector model itself, and consuming models just reference detectors by name. Schema (core/config/model_config.go): - New top-level `pii_detection:` block on a detector model (min_score, default_action, entity_actions) + accessors. - Consumer `pii:` is now `{ enabled, detectors: [<model>...] }`. - `pii.patterns` / `pii.ner` kept only as untyped deprecated shadows so old YAMLs still parse; Validate() warns (does not fail). Middleware (core/services/routing/pii): - Redactor is now a stateless handle; RedactNER(ctx, text, []NERConfig) runs every detector, unions hits, and overlap-merges (block>mask>allow). - NERDetectorResolver returns (NERConfig, bool); the resolver reads each detector model's pii_detection policy (NERConfigFromRaw). - RequestMiddleware is NER-only, multi-detector, fail-closed on a detector that can't be resolved or errors. Regex tier fully removed: patterns.go, config.go (LoadConfig/--pii-config), the response-side StreamFilter, the /api/pii/{patterns,test,decide,persist} admin routes, the MCP list/test/set/persist pattern tools, and the dead --pii-config/--disable-pii AppOptions + runtime_settings overrides. Output/ streaming redaction is dropped for now (NER is request-side only). Cloud-proxy/MITM now runs NER on the request input (mitm/handler.go gets a per-host []NERConfig resolved at listener start), fail-closed; the response is forwarded unmodified. Capability metadata: pii.detectors (model-multi-select filtered to token_classify) + pii_detection.* registry entries; config-metadata autocomplete gains a token_classify case; API instructions rewritten. UI (React) is a follow-up; this is the Go side. Assisted-by: claude-code:claude-opus-4-8 [Claude Code]

- gallery/index.yaml: privacy-filter-multilingual ships a default pii_detection policy (validated against the GGUF's real label set — mask everything; block PASSWORD/PIN/CVV/CREDITCARD/IBAN/BIC/BANKACCOUNT/ SSN/{BITCOIN,ETHEREUM,LITECOIN}ADDRESS). - docs/advanced/model-configuration.md: new "PII filtering" section (pii_detection on detector models + pii.detectors on consumers). - docs/features/middleware.md: rewrote the PII section for the NER-only model; dropped the removed regex pattern catalogue / endpoints / MCP pattern tools / streaming filter. Assisted-by: claude-code:claude-opus-4-8 [Claude Code]

React UI side of the PII redesign. - New EntityActionListEditor (entity group → mask|block|allow map editor for a detector model's pii_detection.entity_actions, with a datalist of common categories) and ModelMultiSelect (capability-filtered detector picker for a consumer's pii.detectors). - ConfigFieldRenderer: dispatch `entity-action-list` + `model-multi-select`; map `models:token_classify` → FLAG_TOKEN_CLASSIFY; drop `pii-pattern-list`. - capabilities.js: CAP_TOKEN_CLASSIFY. - Middleware page: remove the pattern catalogue, the per-pattern action editor, and the "Save to disk" persist flow; the per-model table now shows the NER detectors each config references. Removed the dead pattern-mutation state/handlers. - modelTemplates: MITM template seeds pii.detectors instead of pii.patterns. - Deleted PIIPatternListEditor. - e2e/middleware-page.spec: fixture + tests updated for the detector model; removed the PUT /api/pii/patterns test. Assisted-by: claude-code:claude-opus-4-8 [Claude Code]

richiejp added 10 commits June 2, 2026 11:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(middleware): Use NER models instead of Regex for PII filtering#10160

feat(middleware): Use NER models instead of Regex for PII filtering#10160
richiejp wants to merge 10 commits into
mudler:masterfrom
richiejp:feat/pii-ner-tier

richiejp commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant