DataDog · jbachorik · May 29, 2026 · May 26, 2026 · May 26, 2026 · May 26, 2026
@@ -425,6 +425,37 @@ Verify both:
 ```
 
 ### Structural Hash Rule
+**HARD RULE**: Any time you add or change a field on `DFA.DFAState`, `DFA.DFATransition`, or any
+`PatternInfo` subclass that affects bytecode generation, you MUST also update
+`StructuralHash.java` to include that field in the hash. Failure to do so causes the level-2
+structural cache to return a compiled class built for a different pattern, producing wrong runtime
+results that are extremely hard to debug.
+
+Checklist when touching `DFA.DFAState`, `DFA.DFATransition`, `NFA.NFAState`, or any `PatternInfo`:
+- `DFAState` field added → add it to `computeDFATopologyHash()` state-loop body
+- `DFATransition` field added → add it to `computeDFATopologyHash()` transition-loop body
+- `NFAState` field added → add it to `NFA.contentHashCode()` state-loop body
+- New NFA anchor predicate (`NFA.hasXxx()`) added → add the corresponding flag to `StructuralHash.compute()`
+- `PatternInfo` subclass field added → add it to that class's `structuralHashCode()`
+
+Example — `acceptanceAnchorConditions` and `entryGuard` added post-anchor fix:
+```java
+// DFAState: per-state acceptance anchor conditions. Use ordinal-derived bitmasks for
+// anchor EnumSets, not EnumSet.hashCode(), because Enum.hashCode() is identity-based.
+hash = 31 * hash + anchorBitmask(state.acceptanceAnchorConditions);
+
+// DFATransition: per-transition entry guard
+hash = 31 * hash + anchorBitmask(entry.getValue().entryGuard);
+
+private static int anchorBitmask(EnumSet<NFA.AnchorType> anchors) {
+    int mask = 0;
+    for (NFA.AnchorType anchor : anchors) {
+        mask |= (1 << anchor.ordinal());
+    }
+    return mask;
+}
+```
+
 When creating `PatternInfo` subclasses, `structuralHashCode()` MUST include ALL fields affecting bytecode:
 ```java
 public int structuralHashCode() {
@@ -700,7 +731,6 @@ Falling back to java.util.regex for pattern '<pattern>': <reason>
 | Lookahead inside a quantified group | `(?:(?=\d)\d)+` | `lookahead inside quantified group` |
 | Lookbehind followed by unbounded quantifier | `(?<=\d)[a-z]+` | `lookbehind followed by unbounded quantifier` |
 | Alternation inside lookbehind | `(?<=a\|b)c` | `alternation inside lookbehind` |
-| Lookbehind and lookahead used together | `(?<=\[)[^\]]+(?=\])` | `lookbehind and lookahead combined` |
 
 > **Note:** Bug 1 (multiple backreferences to same group) only applies when the analyzer selects
 > `OPTIMIZED_NFA_WITH_BACKREFS` or `VARIABLE_CAPTURE_BACKREF` strategy. Patterns routed through

@@ -0,0 +1,36 @@
+---
+spec_id: REQ-DataDog-java-reggie-27
+source: github
+source_ref: "DataDog/java-reggie#27"
+title: "[bug] Multiple backreferences to same group produce false positives"
+status: draft
+clarity_score: null
+created: 2026-05-08
+implementing_session: null
+implemented_pr: null
+---
+
+# [bug] Multiple backreferences to same group produce false positives
+
+## Description
+When a pattern references the same capturing group more than once (e.g. `(\w+)\s+\1\s+\1`), the engine returns incorrect results. The second backreference check is not enforced, causing false positives.
+
+## Reproduction
+```java
+ReggieMatcher m = Reggie.compile("(\\w+)\\s+\\1\\s+\\1");
+m.find("go go stop"); // returns true — WRONG, should be false
+m.find("go go go");   // returns true — correct
+```
+
+## Root cause
+Patterns selected by `OPTIMIZED_NFA_WITH_BACKREFS` and `VARIABLE_CAPTURE_BACKREF` strategies do not correctly validate the second occurrence of a backreference to the same group. The group capture state is not properly threaded through the second backref check.
+
+## Current mitigation
+`FallbackPatternDetector` detects this condition and falls back to `java.util.regex`. Patterns with 2+ references to the same group in these strategies are transparently delegated.
+
+## Fix direction
+- `NFABytecodeGenerator`: ensure group capture state persists across multiple backref checks for the same group number
+- `VariableCaptureBackrefBytecodeGenerator`: validate all backreferences, not just the first
+
+## Impact
+High — incorrect match results (false positives) for multi-backref patterns.
@@ -0,0 +1,36 @@
+---
+spec_id: REQ-DataDog-java-reggie-35
+source: github
+source_ref: "DataDog/java-reggie#35"
+title: "[pcre] Inline (?m) flag inside a group doesn't activate multiline mode mid-pattern"
+status: draft
+clarity_score: null
+created: 2026-05-09
+implementing_session: null
+implemented_pr: null
+---
+
+# [pcre] Inline (?m) flag inside a group doesn't activate multiline mode mid-pattern
+
+## Summary
+
+When `(?m)` appears inside a capturing group (not at the start of the pattern), the multiline flag is not correctly activated for the surrounding `^` anchor used in that sub-expression.
+
+## Failing PCRE Test
+
+- Pattern: `\n((?m)^b)`
+- Input: `"a\nb\n"`
+- Expected: matches with group 1 = `b`
+- Actual: no match
+
+**Expected gain**: +1 PCRE conformance test (Category 5)
+
+## Root Cause
+
+Phase 1.2 fixed the anchor-optimization issue for patterns where `(?m)` appears globally (e.g., `(.*X|^B)`). However, when `(?m)` is embedded inline inside a sub-group, the flag-propagation logic doesn't update the anchor-matching behavior for `^` in that local scope.
+
+## Implementation Notes
+
+- Phase 1.2 fixed 4 of the 5 multiline-anchor tests; this is the remaining failure
+- Difficulty: Medium
+- Files likely involved: `RegexParser.java` (flag propagation), NFA anchor handling
@@ -0,0 +1,38 @@
+---
+spec_id: REQ-DataDog-java-reggie-29
+source: github
+source_ref: "DataDog/java-reggie#29"
+title: "[bug] Unbounded quantifier after lookbehind always fails to match"
+status: implementing
+clarity_score: 85
+created: 2026-05-10
+implementing_session: impl-20260510-175457
+implemented_pr: null
+---
+
+# [bug] Unbounded quantifier after lookbehind always fails to match
+
+## Description
+A lookbehind assertion followed by an unbounded quantifier (`+`, `*`, `{n,}`) always returns false, even for inputs that should match.
+
+## Reproduction
+```java
+ReggieMatcher m = Reggie.compile("(?<=\\d)[a-z]+");
+m.find("3abc"); // returns false — WRONG, should be true
+m.find("abc");  // returns false — correct
+
+// Bounded quantifier works:
+Reggie.compile("(?<=\\d)[a-z]{1,4}").find("3abc"); // true — correct
+```
+
+## Root cause
+In the `DFA_UNROLLED_WITH_ASSERTIONS` path, the lookbehind position is not correctly propagated as the starting position for the unbounded quantifier's loop. The loop starts at an incorrect offset and immediately fails.
+
+## Current mitigation
+`FallbackPatternDetector` detects a `ConcatNode` where a lookbehind `AssertionNode` is immediately followed by a `QuantifierNode` with `max == -1` and falls back to `java.util.regex`.
+
+## Fix direction
+After a lookbehind assertion succeeds, the following quantifier loop must start from the correct post-lookbehind position, not from the start of the assertion check.
+
+## Impact
+Medium — affects patterns common in tokenization and text extraction.
@@ -0,0 +1,36 @@
+---
+spec_id: REQ-DataDog-java-reggie-30
+source: github
+source_ref: "DataDog/java-reggie#30"
+title: "[bug] Only first alternative in lookbehind alternation is checked"
+status: draft
+clarity_score: null
+created: 2026-05-10
+implementing_session: null
+implemented_pr: null
+---
+
+# [bug] Only first alternative in lookbehind alternation is checked
+
+## Description
+When a lookbehind assertion contains an alternation (`(?<=a|b)c`), only the first alternative is considered. Subsequent alternatives are silently ignored, causing false negatives.
+
+## Reproduction
+```java
+ReggieMatcher m = Reggie.compile("(?<=a|b)c");
+m.find("ac"); // returns true  — correct
+m.find("bc"); // returns false — WRONG, should be true
+m.find("xc"); // returns false — correct
+```
+
+## Root cause
+The `OPTIMIZED_NFA_WITH_LOOKAROUND` strategy processes lookbehind alternations but only evaluates the first branch. When the first alternative fails, the NFA does not try remaining alternatives in the lookbehind.
+
+## Current mitigation
+`FallbackPatternDetector` detects an `AssertionNode(lookbehind)` whose `subPattern` directly contains an `AlternationNode`, and falls back to `java.util.regex`.
+
+## Fix direction
+In `NFABytecodeGenerator` lookbehind handling: after the lookbehind subpattern fails for one alternative, iterate over all remaining alternatives rather than short-circuiting on the first failure.
+
+## Impact
+Medium — incorrect false negatives for patterns using lookbehind alternatives.
@@ -0,0 +1,37 @@
+---
+spec_id: REQ-DataDog-java-reggie-36
+source: github
+source_ref: "DataDog/java-reggie#36"
+title: "[pcre] Lookahead combined with nested alternation produces wrong group captures"
+status: implemented
+clarity_score: 72
+created: 2026-05-11
+implementing_session: impl-20260511-102846
+implemented_pr: "https://github.com/DataDog/java-reggie/pull/59"
+---
+
+# [pcre] Lookahead combined with nested alternation produces wrong group captures
+
+## Summary
+
+Two PCRE tests involving lookahead assertions nested inside alternations or combined with digit-range character classes produce incorrect group captures.
+
+## Failing PCRE Tests
+
+1. Pattern `(\.\d\d((?=0)|\d(?=\d)))` on input `1.875000282`
+   - Inner `(?=0)` / `\d(?=\d)` alternation inside a capturing group fails to record the correct group 2 value.
+
+2. Pattern `(\.\d\d[1-9]?)\d+` on input `1.235`
+   - Expected group 1 = `.23`, actual = `.235`
+   - The `[1-9]?` optional class greedily consumes one character that should be left to `\d+`.
+
+**Expected gain**: +2 PCRE conformance tests (Category 6, remaining after Phase 2.1)
+
+## Root Cause
+
+These are backtracking/greedy edge cases in patterns where a lookahead sits inside an alternation within a capturing group. The NFA/DFA grouping boundary isn't preserved correctly during lookahead evaluation and the greedy quantifier does not backtrack into the optional class.
+
+## Implementation Notes
+
+- Difficulty: Medium
+- Files likely involved: `NFABytecodeGenerator.java`, lookahead handling in `PatternAnalyzer.java`
@@ -0,0 +1,85 @@
+# Algorithmic fuzz testing against JDK regex
+
+## Motivation
+
+The last three landed fixes (anchor placement, bounded quantifier upper bound,
+SWAR multi-range filter) were all triggered by a user pattern that hit a
+case the existing test suite didn't cover. In every case, the symptom was
+"Reggie disagrees with `java.util.regex.Pattern` on a specific input/regex
+pair." The bugs were not in subtle corner cases of obscure features —
+they were in common shapes (`$X|Y`, `[0-9]{5}`, `[-_]?[0-9]{5,}`) that
+just happened to fall outside the existing hand-written tests.
+
+We need a generator-based test that constructs **syntactically valid
+regexes algorithmically**, runs them against **algorithmically generated
+inputs**, and asserts that Reggie and `java.util.regex.Pattern` agree on
+the result. The point is not to fuzz the parser (random bytes) — it's to
+**enumerate well-typed pattern shapes** and confirm Reggie matches JDK
+semantics across them.
+
+## Scope (what to enumerate)
+
+A grammar-driven generator producing patterns over a small alphabet
+(`a`, `b`, `c`, `0`, `1`, `-`, `_`), bounded in depth and complexity:
+
+- **Atoms**: literal char, char class `[abc]` / `[a-z]` / `[^...]`, `.`
+- **Quantifiers**: `?`, `*`, `+`, `{n}`, `{n,}`, `{n,m}` (greedy and lazy)
+- **Concat / alternation**: 2–3 levels of nesting
+- **Anchors**: `^`, `$`, `\A`, `\Z`, `\z` (placement at start, end, and
+  *interior* of branches — the third has been the bug-magnet)
+- **Groups**: capturing `(...)` and non-capturing `(?:...)`
+- **Backreferences**: `\1` once a `(...)` exists earlier in the pattern
+- **Flags**: `(?i)`, `(?m)`, `(?s)`, both global and inline-scoped
+
+For each pattern, enumerate inputs of length 0..16 over the same
+alphabet plus a few "structural" inputs (newlines, repeated runs of
+each alphabet char). Skip patterns the parser refuses; skip JDK
+`PatternSyntaxException`.
+
+## Oracle
+
+For each (pattern, input) pair compute:
+
+1. JDK: `Pattern.matches(input)`, the iterated `Matcher.find()` sequence
+   (collecting all non-overlapping matches and their group spans).
+2. Reggie: `m.matches(input)`, the iterated `m.findMatch(input, start)`
+   sequence, and the group spans for each match.
+
+Assert byte-for-byte agreement on:
+
+- whether `matches()` returns true,
+- the list of match `start()`/`end()` pairs,
+- per-match group `start(i)`/`end(i)` for `1 <= i <= groupCount`.
+
+## Implementation notes
+
+- A **shrinker** is the difference between "we have a 30-char failing
+  pattern" and "we have a 4-char failing pattern we can debug." Write
+  the generator with the property that any sub-tree of a failing
+  pattern is itself a valid pattern, so a shrink loop can delete
+  subtrees and re-check.
+- Cache the compiled `ReggieMatcher` per pattern across inputs to keep
+  iteration time low; the codegen step dominates otherwise.
+- Run the suite **offline** (not in `./gradlew check`) with a configurable
+  iteration count, plus a CI job that runs a smaller deterministic sample
+  on every PR.
+- When a divergence is found, dump the pattern, the input, both results,
+  the strategy Reggie picked, and the generated bytecode path to a
+  fixture file. The fixture file becomes a regression test.
+
+## Reuse
+
+`reggie-integration-tests` already has infrastructure for comparing
+Reggie against external oracles for PCRE/RE2 corpora — extend it with a
+JDK-Pattern oracle and a generator module, rather than starting fresh.
+
+## Out of scope (separate effort)
+
+- Performance fuzzing — that's `reggie-benchmark`'s job.
+- Round-trip parser fuzzing (random bytes) — different bug class.
+- Cross-engine equivalence beyond JDK (RE2, PCRE) — already partially
+  covered by the existing integration-test corpora.
+
+## Status
+
+Not yet implemented. Tracked as task #14 in the current session.