- Frames 4.5.0 as a focused, lightweight text PII screening release rather than a v5 package overhaul.
- Keeps the first path centered on core install, regex scanning/redaction, CLI text commands, and agent-oriented guardrail helpers.
- Defers dedicated Sentry, OpenTelemetry, logging-framework, and cloud DLP middleware adapters to v5 planning.
- Clarifies the live top-level APIs:
scan,redact,protect,scan_prompt,filter_output,sanitize, and guardrail helpers. - Documents the current module map so users and contributors can distinguish live 4.5 modules from historical compatibility and audit artifacts.
- Preserves backward-compatible
DataFogandTextServiceentry points.
- Adds regex-only German structured PII support without adding core dependencies.
- Detects German VAT IDs and German IBANs by default because their country-code structure is precise enough for default screening.
- Enables broader German identifiers only through
locales=["de"]or explicit entity selection, including German tax IDs, pension insurance numbers, postal codes, passport numbers, and residence permit numbers.
- Certifies Python 3.13 support for the core SDK, CLI,
nlp,nlp-advanced, andocrinstall profiles. - Adds CI coverage for Python 3.13
nlpandnlp-advancedtest profiles plus 3.13 smoke checks fornlp,nlp-advanced, andocr. - Documents Donut OCR as requiring a local model before runtime use.
- Leaves
distributedandalloutside the new Python 3.13 certification claim for 4.5.0.
- Documents OCR and Spark as supported optional surfaces, not deprecated features and not the main 4.5 adoption path.
- Keeps local OCR behind
datafog[ocr], URL image inputs behinddatafog[web,ocr], Donut behinddatafog[nlp-advanced,ocr], and Spark behinddatafog[distributed].
- Documents telemetry behavior without changing defaults.
- Telemetry remains disabled unless
DATAFOG_TELEMETRY=1is set. DATAFOG_NO_TELEMETRY=1andDO_NOT_TRACK=1continue to force telemetry off for tests, CI, and privacy-sensitive environments.
- Adds a 4.5 release-readiness checklist covering docs build, formatting, core no-network checks, install-profile smoke checks, German regex tests, broad non-slow tests, package build checks, and final CI status.
- Clarifies the version alignment path: the development package remains
4.4.0a5until stable release promotion, and the final stable release should publish as4.5.0.
- Added a new internal engine boundary in
datafog/engine.py:scan()redact()scan_and_redact()- dataclasses:
Entity,ScanResult,RedactResult
- Updated core compatibility layers (
datafog.core,datafog.main, CLI paths) to delegate through the engine interface. - Added
EngineNotAvailableerror for clear optional dependency failures. - Improved smart engine behavior for graceful fallback when optional NLP dependencies are unavailable.
- Added a corpus-driven detection accuracy suite:
tests/corpus/structured_pii.jsontests/corpus/unstructured_pii.jsontests/corpus/mixed_pii.jsontests/corpus/negative_cases.jsontests/corpus/edge_cases.jsontests/test_detection_accuracy.py
- Improved regex patterns for email, date/year handling, SSN boundaries, and strict IPv4 matching.
- Added explicit
xfailmarkers for known model limitations in select smart/NER corpus cases. - Added engine API tests in
tests/test_engine_api.py. - Added agent API tests in
tests/test_agent_api.py. - Updated Spark integration tests to skip cleanly when Java is not available.
- Added
datafog/agent.pywith:sanitize()scan_prompt()filter_output()create_guardrail()GuardrailandGuardrailWatch
- Exported agent-oriented API from top-level
datafogpackage.
- Updated GitHub Actions CI matrix to test Python
3.10,3.11, and3.12acrosscore,nlp, andnlp-advancedprofiles. - Added coverage enforcement thresholds in CI (line and branch).
- Added a dedicated corpus accuracy run in CI.
- Rewrote
README.mdwith validated, copy-pasteable examples and a dedicated LLM guardrails section. - Added/updated audit reports under
docs/audit/.
-
GLiNER Integration: Added modern Named Entity Recognition engine with GLiNER (Generalist Model for NER)
- New
glinerengine option in TextService providing 32x performance improvement over spaCy - PII-specialized model support (
urchade/gliner_multi_pii-v1) for enhanced accuracy - Custom entity type configuration for domain-specific detection
- Automatic model downloading and caching functionality
- New
-
Smart Cascading Engine: Introduced intelligent multi-engine approach
- New
smartengine that progressively tries regex → GLiNER → spaCy - Configurable stopping criteria based on entity count thresholds
- Optimized for best accuracy/performance balance (60x average speedup)
- New
-
Enhanced CLI Model Management: Extended command-line interface
--engineflag support fordownload-modelandlist-modelscommands- GLiNER model discovery and management capabilities
- Unified model management across spaCy and GLiNER engines
-
Optional Dependencies: Added new
nlp-advancedextra for GLiNER dependenciespip install datafog[nlp-advanced]for GLiNER + PyTorch + Transformers- Maintained lightweight core architecture (<2MB)
- Graceful degradation when GLiNER dependencies unavailable
-
Engine Ecosystem: Expanded from 3 to 5 annotation engines
regex: 190x faster, structured PII detection (core only)gliner: 32x faster, modern NER with custom entitiesspacy: Traditional NLP, comprehensive entity recognitionsmart: Cascading approach for optimal accuracy/speedauto: Legacy regex→spaCy fallback
-
Validated Performance: Comprehensive benchmarking across all engines
- GLiNER: 32x faster than spaCy with superior NER accuracy
- Smart cascading: 60x average speedup with highest accuracy scores
- Regex: Maintained 190x performance advantage
-
Comprehensive Testing: Added 19 new test cases for GLiNER integration
- Full coverage of GLiNER annotator functionality
- Graceful degradation testing for missing dependencies
- Smart cascading logic validation
- Cross-engine integration testing
-
Updated Documentation: Comprehensive guides and examples
- README performance comparison table with all 5 engines
- Engine selection guidance with use case recommendations
- GLiNER model management and CLI usage examples
- Installation options for different dependency combinations
-
Developer Guide: Streamlined development documentation
- Updated architecture overview with GLiNER integration
- Performance requirements and testing strategies
- Common development patterns and best practices
- Engine Options: New engine types added to TextService
- Existing code using
engine="auto"continues to work unchanged - New engines
glinerandsmartrequire[nlp-advanced]extra
- Existing code using
- New Optional Dependencies (nlp-advanced extra):
gliner>=0.2.5torch>=2.1.0,<2.7transformers>=4.20.0huggingface-hub>=0.16.0
For users upgrading from v4.1.1:
- All existing functionality remains unchanged
- To use GLiNER:
pip install datafog[nlp-advanced] - Smart cascading:
TextService(engine="smart")for best balance - CLI: Use
--engine glinerflag for GLiNER model management
- Added engine selection functionality to TextService class, allowing users to choose between 'regex', 'spacy', or 'auto' annotation engines
- Enhanced TextService with intelligent fallback mechanism in 'auto' mode that tries regex first and falls back to spaCy if no entities are found
- Added comprehensive integration tests for the new engine selection feature
- Implemented performance benchmarks showing regex engine is ~123x faster than spaCy
- Added CI pipeline for continuous performance monitoring with regression detection
- Added wheel-size gate (< 8 MB) to CI pipeline
- Added 'When do I need spaCy?' guidance to documentation
- Created scripts for running benchmarks locally and comparing results
- Improved documentation with performance metrics and engine selection guidance
- Extended .gitignore to better handle build artifacts and development files
- Added GitHub Actions workflows for testing, linting, and benchmarking
- Pinned all dependency versions in requirements.txt and requirements-dev.txt for reproducible builds
- Added mypy type checking to CI pipeline
- Added ruff linting to development dependencies
- Finalized stable release, no breaking changes from 4.1.0b5
- Added datafog-python/examples/uploading-file-types.ipynb to show JSON uploading example (#16)
- Added datafog-python/tests/regex_issue.py to show issue with regex recognizer creation
- Moved versioning to separate invocable function in setup.py