Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
40 commits
Select commit Hold shift + click to select a range
95f71ec
fix: anchor-aware DFA construction
jbachorik May 26, 2026
3fdeee5
fix: bounded quantifiers find() — upper-bound cap + SWAR multi-range …
jbachorik May 26, 2026
791423c
test: algorithmic fuzz test cross-checking Reggie vs JDK Pattern
jbachorik May 26, 2026
c13e30c
test: shrinker for fuzz findings + triage doc
jbachorik May 26, 2026
3615479
fix: counted quantifier with min=0 missing zero-reps ε-bypass
jbachorik May 26, 2026
22f1890
fix: find() loop no longer short-circuits on patterns where only one …
jbachorik May 26, 2026
218d487
fix: lazy quantifiers respect find-vs-matches mode + zero-width count…
jbachorik May 27, 2026
1b70fcd
fix: fuzz findings — SWAR multi-literal bug, cross-alt backref fallba…
jbachorik May 28, 2026
e01a5ab
fix: anchor-condition dilution + matches(\Z) + nullable backref + laz…
jbachorik May 28, 2026
f7741d4
fix: fuzz findings — SPECIALIZED_QUANTIFIED_GROUP, GreedyCharClass, O…
jbachorik May 28, 2026
fc8110d
docs: fuzz triage, issue priority list, libretti, and anchor diag test
jbachorik May 28, 2026
1b26625
fix: narrow optional-quantifier DFA conflict check to avoid false fal…
jbachorik May 28, 2026
800fbba
feat: support atomic groups and quoted literals
jbachorik May 29, 2026
51101c6
feat: add matchInto capture boundary API
jbachorik May 29, 2026
c662831
feat: fall back for oversized DFA state spaces
jbachorik May 29, 2026
6de21bd
bench: add logs backend grok benchmark
jbachorik May 29, 2026
1765624
feat: add allocation-free NFA matchInto
jbachorik May 29, 2026
91ea690
feat: add recursive descent matchInto
jbachorik May 29, 2026
259ccf3
fix: avoid recursive descent for delimited negated captures
jbachorik May 29, 2026
2e6d423
feat: add table-driven DFA backend
jbachorik May 29, 2026
f1d7584
perf: skip impossible DFA table scan starts
jbachorik May 29, 2026
1265585
feat: complete P2 parser compatibility
jbachorik May 29, 2026
a0bcf08
fix: support combined lookaround assertions
jbachorik May 29, 2026
f9b1923
chore: include method details for oversized bytecode fallback
jbachorik May 29, 2026
32a7d25
feat: add capture projection options
jbachorik May 29, 2026
4151297
feat: specialize access log grok matching
jbachorik May 29, 2026
4351e5c
feat: add structural pattern categorizer
jbachorik May 29, 2026
5ea1e7d
feat: classify reusable log pattern atoms
jbachorik May 29, 2026
90da424
feat: add linear template planning
jbachorik May 29, 2026
092fd5a
feat: add linear template runtime matcher
jbachorik May 29, 2026
fd90e24
feat: route named linear templates
jbachorik May 29, 2026
31eec11
feat: handle access log templates generically
jbachorik May 29, 2026
254c604
refactor: remove access log oracle routing
jbachorik May 29, 2026
a470d08
perf: reuse linear template optional scratch
jbachorik May 29, 2026
6b9db9d
refactor: remove access log matcher oracle
jbachorik May 29, 2026
af8f55f
test: assert real grok patterns use linear templates
jbachorik May 29, 2026
6082430
refactor: rename linear templates to token sequences
jbachorik May 29, 2026
a4cb218
test: harden log token sequence equivalence
jbachorik May 29, 2026
b98aef6
fix: harden token sequence capture semantics
jbachorik May 29, 2026
a0873c1
test: assert anchor diagnostics
jbachorik May 29, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 31 additions & 1 deletion AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -425,6 +425,37 @@ Verify both:
```

### Structural Hash Rule
**HARD RULE**: Any time you add or change a field on `DFA.DFAState`, `DFA.DFATransition`, or any
`PatternInfo` subclass that affects bytecode generation, you MUST also update
`StructuralHash.java` to include that field in the hash. Failure to do so causes the level-2
structural cache to return a compiled class built for a different pattern, producing wrong runtime
results that are extremely hard to debug.

Checklist when touching `DFA.DFAState`, `DFA.DFATransition`, `NFA.NFAState`, or any `PatternInfo`:
- `DFAState` field added → add it to `computeDFATopologyHash()` state-loop body
- `DFATransition` field added → add it to `computeDFATopologyHash()` transition-loop body
- `NFAState` field added → add it to `NFA.contentHashCode()` state-loop body
- New NFA anchor predicate (`NFA.hasXxx()`) added → add the corresponding flag to `StructuralHash.compute()`
- `PatternInfo` subclass field added → add it to that class's `structuralHashCode()`

Example — `acceptanceAnchorConditions` and `entryGuard` added post-anchor fix:
```java
// DFAState: per-state acceptance anchor conditions. Use ordinal-derived bitmasks for
// anchor EnumSets, not EnumSet.hashCode(), because Enum.hashCode() is identity-based.
hash = 31 * hash + anchorBitmask(state.acceptanceAnchorConditions);

// DFATransition: per-transition entry guard
hash = 31 * hash + anchorBitmask(entry.getValue().entryGuard);

private static int anchorBitmask(EnumSet<NFA.AnchorType> anchors) {
int mask = 0;
for (NFA.AnchorType anchor : anchors) {
mask |= (1 << anchor.ordinal());
}
return mask;
}
```
Comment thread
jbachorik marked this conversation as resolved.

When creating `PatternInfo` subclasses, `structuralHashCode()` MUST include ALL fields affecting bytecode:
```java
public int structuralHashCode() {
Expand Down Expand Up @@ -700,7 +731,6 @@ Falling back to java.util.regex for pattern '<pattern>': <reason>
| Lookahead inside a quantified group | `(?:(?=\d)\d)+` | `lookahead inside quantified group` |
| Lookbehind followed by unbounded quantifier | `(?<=\d)[a-z]+` | `lookbehind followed by unbounded quantifier` |
| Alternation inside lookbehind | `(?<=a\|b)c` | `alternation inside lookbehind` |
| Lookbehind and lookahead used together | `(?<=\[)[^\]]+(?=\])` | `lookbehind and lookahead combined` |

> **Note:** Bug 1 (multiple backreferences to same group) only applies when the analyzer selects
> `OPTIMIZED_NFA_WITH_BACKREFS` or `VARIABLE_CAPTURE_BACKREF` strategy. Patterns routed through
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
---
spec_id: REQ-DataDog-java-reggie-27
source: github
source_ref: "DataDog/java-reggie#27"
title: "[bug] Multiple backreferences to same group produce false positives"
status: draft
clarity_score: null
created: 2026-05-08
implementing_session: null
implemented_pr: null
---

# [bug] Multiple backreferences to same group produce false positives

## Description
When a pattern references the same capturing group more than once (e.g. `(\w+)\s+\1\s+\1`), the engine returns incorrect results. The second backreference check is not enforced, causing false positives.

## Reproduction
```java
ReggieMatcher m = Reggie.compile("(\\w+)\\s+\\1\\s+\\1");
m.find("go go stop"); // returns true — WRONG, should be false
m.find("go go go"); // returns true — correct
```

## Root cause
Patterns selected by `OPTIMIZED_NFA_WITH_BACKREFS` and `VARIABLE_CAPTURE_BACKREF` strategies do not correctly validate the second occurrence of a backreference to the same group. The group capture state is not properly threaded through the second backref check.

## Current mitigation
`FallbackPatternDetector` detects this condition and falls back to `java.util.regex`. Patterns with 2+ references to the same group in these strategies are transparently delegated.

## Fix direction
- `NFABytecodeGenerator`: ensure group capture state persists across multiple backref checks for the same group number
- `VariableCaptureBackrefBytecodeGenerator`: validate all backreferences, not just the first

## Impact
High — incorrect match results (false positives) for multi-backref patterns.
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
---
spec_id: REQ-DataDog-java-reggie-35
source: github
source_ref: "DataDog/java-reggie#35"
title: "[pcre] Inline (?m) flag inside a group doesn't activate multiline mode mid-pattern"
status: draft
clarity_score: null
created: 2026-05-09
implementing_session: null
implemented_pr: null
---

# [pcre] Inline (?m) flag inside a group doesn't activate multiline mode mid-pattern

## Summary

When `(?m)` appears inside a capturing group (not at the start of the pattern), the multiline flag is not correctly activated for the surrounding `^` anchor used in that sub-expression.

## Failing PCRE Test

- Pattern: `\n((?m)^b)`
- Input: `"a\nb\n"`
- Expected: matches with group 1 = `b`
- Actual: no match

**Expected gain**: +1 PCRE conformance test (Category 5)

## Root Cause

Phase 1.2 fixed the anchor-optimization issue for patterns where `(?m)` appears globally (e.g., `(.*X|^B)`). However, when `(?m)` is embedded inline inside a sub-group, the flag-propagation logic doesn't update the anchor-matching behavior for `^` in that local scope.

## Implementation Notes

- Phase 1.2 fixed 4 of the 5 multiline-anchor tests; this is the remaining failure
- Difficulty: Medium
- Files likely involved: `RegexParser.java` (flag propagation), NFA anchor handling
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
---
spec_id: REQ-DataDog-java-reggie-29
source: github
source_ref: "DataDog/java-reggie#29"
title: "[bug] Unbounded quantifier after lookbehind always fails to match"
status: implementing
clarity_score: 85
created: 2026-05-10
implementing_session: impl-20260510-175457
implemented_pr: null
---

# [bug] Unbounded quantifier after lookbehind always fails to match

## Description
A lookbehind assertion followed by an unbounded quantifier (`+`, `*`, `{n,}`) always returns false, even for inputs that should match.

## Reproduction
```java
ReggieMatcher m = Reggie.compile("(?<=\\d)[a-z]+");
m.find("3abc"); // returns false — WRONG, should be true
m.find("abc"); // returns false — correct

// Bounded quantifier works:
Reggie.compile("(?<=\\d)[a-z]{1,4}").find("3abc"); // true — correct
```

## Root cause
In the `DFA_UNROLLED_WITH_ASSERTIONS` path, the lookbehind position is not correctly propagated as the starting position for the unbounded quantifier's loop. The loop starts at an incorrect offset and immediately fails.

## Current mitigation
`FallbackPatternDetector` detects a `ConcatNode` where a lookbehind `AssertionNode` is immediately followed by a `QuantifierNode` with `max == -1` and falls back to `java.util.regex`.

## Fix direction
After a lookbehind assertion succeeds, the following quantifier loop must start from the correct post-lookbehind position, not from the start of the assertion check.

## Impact
Medium — affects patterns common in tokenization and text extraction.
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
---
spec_id: REQ-DataDog-java-reggie-30
source: github
source_ref: "DataDog/java-reggie#30"
title: "[bug] Only first alternative in lookbehind alternation is checked"
status: draft
clarity_score: null
created: 2026-05-10
implementing_session: null
implemented_pr: null
---

# [bug] Only first alternative in lookbehind alternation is checked

## Description
When a lookbehind assertion contains an alternation (`(?<=a|b)c`), only the first alternative is considered. Subsequent alternatives are silently ignored, causing false negatives.

## Reproduction
```java
ReggieMatcher m = Reggie.compile("(?<=a|b)c");
m.find("ac"); // returns true — correct
m.find("bc"); // returns false — WRONG, should be true
m.find("xc"); // returns false — correct
```

## Root cause
The `OPTIMIZED_NFA_WITH_LOOKAROUND` strategy processes lookbehind alternations but only evaluates the first branch. When the first alternative fails, the NFA does not try remaining alternatives in the lookbehind.

## Current mitigation
`FallbackPatternDetector` detects an `AssertionNode(lookbehind)` whose `subPattern` directly contains an `AlternationNode`, and falls back to `java.util.regex`.

## Fix direction
In `NFABytecodeGenerator` lookbehind handling: after the lookbehind subpattern fails for one alternative, iterate over all remaining alternatives rather than short-circuiting on the first failure.

## Impact
Medium — incorrect false negatives for patterns using lookbehind alternatives.
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
---
spec_id: REQ-DataDog-java-reggie-36
source: github
source_ref: "DataDog/java-reggie#36"
title: "[pcre] Lookahead combined with nested alternation produces wrong group captures"
status: implemented
clarity_score: 72
created: 2026-05-11
implementing_session: impl-20260511-102846
implemented_pr: "https://github.com/DataDog/java-reggie/pull/59"
---

# [pcre] Lookahead combined with nested alternation produces wrong group captures

## Summary

Two PCRE tests involving lookahead assertions nested inside alternations or combined with digit-range character classes produce incorrect group captures.

## Failing PCRE Tests

1. Pattern `(\.\d\d((?=0)|\d(?=\d)))` on input `1.875000282`
- Inner `(?=0)` / `\d(?=\d)` alternation inside a capturing group fails to record the correct group 2 value.

2. Pattern `(\.\d\d[1-9]?)\d+` on input `1.235`
- Expected group 1 = `.23`, actual = `.235`
- The `[1-9]?` optional class greedily consumes one character that should be left to `\d+`.

**Expected gain**: +2 PCRE conformance tests (Category 6, remaining after Phase 2.1)

## Root Cause

These are backtracking/greedy edge cases in patterns where a lookahead sits inside an alternation within a capturing group. The NFA/DFA grouping boundary isn't preserved correctly during lookahead evaluation and the greedy quantifier does not backtrack into the optional class.

## Implementation Notes

- Difficulty: Medium
- Files likely involved: `NFABytecodeGenerator.java`, lookahead handling in `PatternAnalyzer.java`
85 changes: 85 additions & 0 deletions doc/plans/algorithmic-fuzz-tests-vs-jdk.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
# Algorithmic fuzz testing against JDK regex

## Motivation

The last three landed fixes (anchor placement, bounded quantifier upper bound,
SWAR multi-range filter) were all triggered by a user pattern that hit a
case the existing test suite didn't cover. In every case, the symptom was
"Reggie disagrees with `java.util.regex.Pattern` on a specific input/regex
pair." The bugs were not in subtle corner cases of obscure features —
they were in common shapes (`$X|Y`, `[0-9]{5}`, `[-_]?[0-9]{5,}`) that
just happened to fall outside the existing hand-written tests.

We need a generator-based test that constructs **syntactically valid
regexes algorithmically**, runs them against **algorithmically generated
inputs**, and asserts that Reggie and `java.util.regex.Pattern` agree on
the result. The point is not to fuzz the parser (random bytes) — it's to
**enumerate well-typed pattern shapes** and confirm Reggie matches JDK
semantics across them.

## Scope (what to enumerate)

A grammar-driven generator producing patterns over a small alphabet
(`a`, `b`, `c`, `0`, `1`, `-`, `_`), bounded in depth and complexity:

- **Atoms**: literal char, char class `[abc]` / `[a-z]` / `[^...]`, `.`
- **Quantifiers**: `?`, `*`, `+`, `{n}`, `{n,}`, `{n,m}` (greedy and lazy)
- **Concat / alternation**: 2–3 levels of nesting
- **Anchors**: `^`, `$`, `\A`, `\Z`, `\z` (placement at start, end, and
*interior* of branches — the third has been the bug-magnet)
- **Groups**: capturing `(...)` and non-capturing `(?:...)`
- **Backreferences**: `\1` once a `(...)` exists earlier in the pattern
- **Flags**: `(?i)`, `(?m)`, `(?s)`, both global and inline-scoped

For each pattern, enumerate inputs of length 0..16 over the same
alphabet plus a few "structural" inputs (newlines, repeated runs of
each alphabet char). Skip patterns the parser refuses; skip JDK
`PatternSyntaxException`.

## Oracle

For each (pattern, input) pair compute:

1. JDK: `Pattern.matches(input)`, the iterated `Matcher.find()` sequence
(collecting all non-overlapping matches and their group spans).
2. Reggie: `m.matches(input)`, the iterated `m.findMatch(input, start)`
sequence, and the group spans for each match.

Assert byte-for-byte agreement on:

- whether `matches()` returns true,
- the list of match `start()`/`end()` pairs,
- per-match group `start(i)`/`end(i)` for `1 <= i <= groupCount`.

## Implementation notes

- A **shrinker** is the difference between "we have a 30-char failing
pattern" and "we have a 4-char failing pattern we can debug." Write
the generator with the property that any sub-tree of a failing
pattern is itself a valid pattern, so a shrink loop can delete
subtrees and re-check.
- Cache the compiled `ReggieMatcher` per pattern across inputs to keep
iteration time low; the codegen step dominates otherwise.
- Run the suite **offline** (not in `./gradlew check`) with a configurable
iteration count, plus a CI job that runs a smaller deterministic sample
on every PR.
- When a divergence is found, dump the pattern, the input, both results,
the strategy Reggie picked, and the generated bytecode path to a
fixture file. The fixture file becomes a regression test.

## Reuse

`reggie-integration-tests` already has infrastructure for comparing
Reggie against external oracles for PCRE/RE2 corpora — extend it with a
JDK-Pattern oracle and a generator module, rather than starting fresh.

## Out of scope (separate effort)

- Performance fuzzing — that's `reggie-benchmark`'s job.
- Round-trip parser fuzzing (random bytes) — different bug class.
- Cross-engine equivalence beyond JDK (RE2, PCRE) — already partially
covered by the existing integration-test corpora.

## Status

Not yet implemented. Tracked as task #14 in the current session.
Loading
Loading