Skip to content

Commit a79cd8f

Browse files
committed
chore: land v4.5 release prep baseline
Refs DFPY-71
1 parent c1093d6 commit a79cd8f

11 files changed

Lines changed: 231 additions & 115 deletions

File tree

.bumpversion.cfg

Lines changed: 1 addition & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
[bumpversion]
2-
current_version = 4.3.0
2+
current_version = 4.4.0a5
33
commit = True
44
tag = True
55
tag_name = v{new_version}
@@ -20,7 +20,3 @@ values =
2020
[bumpversion:file:datafog/__about__.py]
2121
search = __version__ = "{current_version}"
2222
replace = __version__ = "{new_version}"
23-
24-
[bumpversion:file:setup.py]
25-
search = version="{current_version}"
26-
replace = version="{new_version}"

.gitignore

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,7 @@ error_log.txt
2424
# Environment
2525
.env
2626
.venv
27+
.venv*/
2728
venv/
2829
env/
2930
examples/venv/
@@ -58,14 +59,14 @@ docs/*
5859
!docs/conf.py
5960
!docs/Makefile
6061
!docs/make.bat
62+
!docs/agents/
63+
!docs/agents/**
6164
!docs/audit/
6265
!docs/audit/**
6366

6467
# Keep all directories but ignore their contents
6568
*/**/__pycache__/
6669

67-
# Keep all files but ignore their contents
68-
Claude.md
6970
notes/benchmarking_notes.md
7071
Roadmap.md
7172
notes/*

Claude.md renamed to AGENTS.md

Lines changed: 70 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -1,18 +1,26 @@
1-
# DataFog - Claude Development Guide
1+
# DataFog - Agent Development Guide
22

33
## Project Overview
4+
45
**DataFog** is an open-source Python library for PII detection and anonymization with a focus on speed and lightweight architecture.
56

67
## Core Value Proposition
8+
79
- **Ultra-Fast Performance**: 190x faster than spaCy for structured PII, 32x faster with GLiNER
810
- **Lightweight Core**: <2MB package with optional ML extras
911
- **Modern Engine Options**: Regex, GLiNER, spaCy, and smart cascading
1012
- **Production Ready**: Comprehensive testing, CI/CD, and performance validation
1113

1214
## Current Project Status
13-
**Version: 4.3.0**
15+
16+
**Stable version: 4.4.0**
17+
18+
**Development version: 4.4.0a5**
19+
20+
**Next minor target: 4.5.0**
1421

1522
### ✅ Recently Completed (Latest)
23+
1624
- **GLiNER Integration**: Modern NER engine with PII-specialized models
1725
- **Smart Cascading**: Intelligent regex → GLiNER → spaCy progression
1826
- **Enhanced CLI**: Model management with `--engine` flags
@@ -43,6 +51,7 @@ python -c "from datafog.services.text_service import TextService; print('✅ All
4351
## Architecture Overview
4452

4553
### Engine Ecosystem (Updated with GLiNER)
54+
4655
```python
4756
from datafog.services.text_service import TextService
4857

@@ -59,37 +68,42 @@ auto_service = TextService(engine="auto") # Legacy: regex→spaCy
5968
```
6069

6170
### Performance Comparison (Validated)
62-
| Engine | Speed vs spaCy | Accuracy | Use Case | Install |
63-
|---------|----------------|----------|----------|---------|
64-
| `regex` | **190x faster** | High (structured) | Emails, phones, SSNs | Core only |
65-
| `gliner` | **32x faster** | Very High | Modern NER, custom entities | `[nlp-advanced]` |
66-
| `spacy` | 1x (baseline) | Good | Traditional NLP | `[nlp]` |
67-
| `smart` | **60x faster** | Highest | Best balance | `[nlp-advanced]` |
71+
72+
| Engine | Speed vs spaCy | Accuracy | Use Case | Install |
73+
| -------- | --------------- | ----------------- | --------------------------- | ---------------- |
74+
| `regex` | **190x faster** | High (structured) | Emails, phones, SSNs | Core only |
75+
| `gliner` | **32x faster** | Very High | Modern NER, custom entities | `[nlp-advanced]` |
76+
| `spacy` | 1x (baseline) | Good | Traditional NLP | `[nlp]` |
77+
| `smart` | **60x faster** | Highest | Best balance | `[nlp-advanced]` |
6878

6979
### Dependency Strategy
80+
7081
```python
7182
# Lightweight core (<2MB)
7283
pip install datafog
7384

7485
# Optional ML engines
7586
pip install datafog[nlp] # spaCy (traditional NLP)
76-
pip install datafog[nlp-advanced] # GLiNER (modern NER)
87+
pip install datafog[nlp-advanced] # GLiNER (modern NER)
7788
pip install datafog[ocr] # Image processing
7889
pip install datafog[all] # Everything
7990
```
8091

8192
## GLiNER Integration (NEW)
8293

8394
### Overview
95+
8496
GLiNER (Generalist Model for Named Entity Recognition) provides modern, accurate NER capabilities optimized for PII detection.
8597

8698
### Key Features
99+
87100
- **PII-Specialized Models**: `urchade/gliner_multi_pii-v1` trained specifically for PII
88101
- **Custom Entity Types**: Configurable entity detection beyond default PII types
89102
- **Smart Cascading**: Automatically tries regex first, GLiNER second, spaCy last
90103
- **CLI Management**: Download and manage GLiNER models via CLI
91104

92105
### Usage Examples
106+
93107
```python
94108
# GLiNER engine
95109
from datafog.services.text_service import TextService
@@ -108,6 +122,7 @@ subprocess.run(["datafog", "list-models", "--engine", "gliner"])
108122
```
109123

110124
### Available GLiNER Models
125+
111126
- `urchade/gliner_multi_pii-v1` - PII-specialized (recommended)
112127
- `urchade/gliner_base` - General purpose starter
113128
- `urchade/gliner_large-v2` - Higher accuracy
@@ -116,17 +131,19 @@ subprocess.run(["datafog", "list-models", "--engine", "gliner"])
116131
## Development Workflow
117132

118133
### Git Branch Strategy
134+
119135
- **main**: Production releases only
120136
- **dev**: Main development branch (use this)
121-
- **feature/***: New features from dev
122-
- **fix/***: Bug fixes from dev
137+
- **feature/\***: New features from dev
138+
- **fix/\***: Bug fixes from dev
123139

124140
### Making Changes
141+
125142
```bash
126143
# Start from dev
127144
git checkout dev && git pull origin dev
128145

129-
# Create feature branch
146+
# Create feature branch
130147
git checkout -b feature/your-change
131148

132149
# Make changes, test, commit
@@ -137,6 +154,7 @@ git push -u origin feature/your-change
137154
```
138155

139156
### Testing
157+
140158
```bash
141159
# Run specific test suites
142160
pytest tests/test_text_service.py -v # Core functionality
@@ -149,13 +167,14 @@ PYTEST_DONUT=yes pytest tests/test_ocr_integration.py # OCR with real models
149167

150168
# Performance requirements
151169
# - Regex: 150x+ faster than spaCy
152-
# - GLiNER: 25x+ faster than spaCy
170+
# - GLiNER: 25x+ faster than spaCy
153171
# - Package size: Core <2MB, full <8MB
154172
```
155173

156174
## Key Implementation Patterns
157175

158176
### Simple API (Recommended)
177+
159178
```python
160179
# Always available, lightweight
161180
from datafog import detect, process
@@ -164,6 +183,7 @@ result = process("john@example.com", method="redact")
164183
```
165184

166185
### Advanced Engine Selection
186+
167187
```python
168188
# For specialized use cases
169189
from datafog.services.text_service import TextService
@@ -173,7 +193,7 @@ service = TextService(engine="regex")
173193

174194
# Modern NER with custom entities
175195
service = TextService(
176-
engine="gliner",
196+
engine="gliner",
177197
gliner_model="urchade/gliner_base"
178198
)
179199

@@ -182,6 +202,7 @@ service = TextService(engine="smart")
182202
```
183203

184204
### Graceful Degradation
205+
185206
```python
186207
# Handles missing dependencies elegantly
187208
try:
@@ -194,18 +215,21 @@ except ImportError:
194215
## Common Tasks
195216

196217
### Adding New Entity Types
218+
197219
1. Update regex patterns in `regex_annotator.py`
198220
2. Add GLiNER entity types in `gliner_annotator.py`
199221
3. Update tests and benchmarks
200222
4. Validate performance doesn't regress >10%
201223

202224
### Performance Optimization
225+
203226
1. Profile with existing benchmarks
204227
2. Maintain speed thresholds (regex 150x+, GLiNER 25x+)
205228
3. Update baselines when making improvements
206229
4. Test across all engines
207230

208231
### CLI Enhancements
232+
209233
1. Update `client.py` with new commands
210234
2. Support `--engine` flag for multi-engine commands
211235
3. Add comprehensive help text and examples
@@ -215,31 +239,36 @@ except ImportError:
215239

216240
### Workflow Architecture (3 workflows)
217241

218-
| Workflow | Purpose | Trigger |
219-
|----------|---------|---------|
220-
| `ci.yml` | Lint + Test + Coverage + Wheel size | Push/PR to main/dev |
221-
| `release.yml` | Alpha/Beta/Stable publishing | Schedule + manual dispatch |
222-
| `benchmark.yml` | Performance benchmarks | Push/PR/weekly |
242+
| Workflow | Purpose | Trigger |
243+
| --------------- | ----------------------------------- | -------------------------- |
244+
| `ci.yml` | Lint + Test + Coverage + Wheel size | Push/PR to main/dev |
245+
| `release.yml` | Alpha/Beta/Stable publishing | Schedule + manual dispatch |
246+
| `benchmark.yml` | Performance benchmarks | Push/PR/weekly |
223247

224248
### Release Cadence
249+
225250
- **Alpha** (Mon-Wed 2AM UTC): Automatic from `dev`, date+commit versioning
226251
- **Beta** (Thursday 2AM UTC): Automatic from `dev`, incremental beta numbers
227252
- **Stable** (manual dispatch): From `main`, base version or override
228253

229254
### Release Pipeline
255+
230256
`determine-release``test``publish``cleanup`
257+
231258
- Tests are a hard gate — no tests = no publish
232259
- Stable releases check out `main`; alpha/beta check out `dev`
233260
- Old alphas pruned to 7, betas to 5
234261
- `[skip ci]` in version bump commits to prevent loops
235262

236263
### Pre-commit Hooks
264+
237265
- **isort**, **black**, **flake8**, **ruff**: Code formatting and linting
238266
- **prettier**: Markdown, JSON, YAML formatting
239267
- **gitleaks**: Secret scanning
240268
- **pre-commit-hooks**: Large file checks, merge conflict detection, YAML validation
241269

242270
## Environment Variables
271+
243272
```bash
244273
# Testing configuration
245274
export PYTEST_DONUT=yes # Enable real OCR testing
@@ -250,33 +279,51 @@ export PYTHONPATH=$(pwd) # Local development imports
250279
```
251280

252281
## Performance Requirements
282+
253283
- **Core Package**: <2MB (from ~8MB in v4.0.x)
254284
- **Regex Engine**: 150x+ faster than spaCy (currently 190x)
255-
- **GLiNER Engine**: 25x+ faster than spaCy (currently 32x)
285+
- **GLiNER Engine**: 25x+ faster than spaCy (currently 32x)
256286
- **Memory Usage**: Graceful handling of large texts (1MB+ chunks)
257287
- **Model Loading**: Cache GLiNER models to avoid repeated downloads
258288

259-
## Best Practices for Claude Agents
289+
## Agent skills
290+
291+
### Issue tracker
292+
293+
Issues and PRDs are tracked in Linear under the DFPY team. See `docs/agents/issue-tracker.md`.
294+
295+
### Triage labels
296+
297+
Use the default five-label triage vocabulary. See `docs/agents/triage-labels.md`.
298+
299+
### Domain docs
300+
301+
Single-context repo: use root `CONTEXT.md` and root `docs/adr/` when present. See `docs/agents/domain.md`.
302+
303+
## Best Practices for Agents
260304

261305
Before beginning any task please checkout a branch from `dev` and create a pull request to `dev`.
262306

263307
### Code Quality
308+
264309
- Follow existing patterns before implementing new approaches
265310
- Add comprehensive tests for all new functionality
266311
- Update documentation immediately with code changes
267312
- Run benchmarks for any text processing modifications
268313

269314
### GLiNER Development
315+
270316
- Use PII-specialized models when available (`urchade/gliner_multi_pii-v1`)
271317
- Test graceful degradation when GLiNER dependencies missing
272318
- Validate smart cascading thresholds with real data
273319
- Consider model download time and caching strategies
274320

275321
### Release Preparation
322+
276323
- Alpha/beta releases are automated via `release.yml` schedule
277324
- Stable releases: merge `dev``main`, then trigger `release.yml` with `stable` type
278325
- Use `dry_run: true` to validate before actual publish
279326
- Performance validation on realistic data sets
280-
- In Release Notes or Comments, do not reference that it was authored by Claude (all code is anonymously authored)
327+
- In Release Notes or Comments, do not reference that it was authored by an AI agent (all code is anonymously authored)
281328

282-
This guide provides the essential information for DataFog development while maintaining focus on current priorities and recent GLiNER integration work.
329+
This guide provides the essential information for DataFog development while maintaining focus on current priorities and recent GLiNER integration work.

0 commit comments

Comments
 (0)