-
Notifications
You must be signed in to change notification settings - Fork 1
Improve runtime compatibility, capture extraction, and token-sequence execution #68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
40 commits
Select commit
Hold shift + click to select a range
95f71ec
fix: anchor-aware DFA construction
jbachorik 3fdeee5
fix: bounded quantifiers find() — upper-bound cap + SWAR multi-range …
jbachorik 791423c
test: algorithmic fuzz test cross-checking Reggie vs JDK Pattern
jbachorik c13e30c
test: shrinker for fuzz findings + triage doc
jbachorik 3615479
fix: counted quantifier with min=0 missing zero-reps ε-bypass
jbachorik 22f1890
fix: find() loop no longer short-circuits on patterns where only one …
jbachorik 218d487
fix: lazy quantifiers respect find-vs-matches mode + zero-width count…
jbachorik 1b70fcd
fix: fuzz findings — SWAR multi-literal bug, cross-alt backref fallba…
jbachorik e01a5ab
fix: anchor-condition dilution + matches(\Z) + nullable backref + laz…
jbachorik f7741d4
fix: fuzz findings — SPECIALIZED_QUANTIFIED_GROUP, GreedyCharClass, O…
jbachorik fc8110d
docs: fuzz triage, issue priority list, libretti, and anchor diag test
jbachorik 1b26625
fix: narrow optional-quantifier DFA conflict check to avoid false fal…
jbachorik 800fbba
feat: support atomic groups and quoted literals
jbachorik 51101c6
feat: add matchInto capture boundary API
jbachorik c662831
feat: fall back for oversized DFA state spaces
jbachorik 6de21bd
bench: add logs backend grok benchmark
jbachorik 1765624
feat: add allocation-free NFA matchInto
jbachorik 91ea690
feat: add recursive descent matchInto
jbachorik 259ccf3
fix: avoid recursive descent for delimited negated captures
jbachorik 2e6d423
feat: add table-driven DFA backend
jbachorik f1d7584
perf: skip impossible DFA table scan starts
jbachorik 1265585
feat: complete P2 parser compatibility
jbachorik a0bcf08
fix: support combined lookaround assertions
jbachorik f9b1923
chore: include method details for oversized bytecode fallback
jbachorik 32a7d25
feat: add capture projection options
jbachorik 4151297
feat: specialize access log grok matching
jbachorik 4351e5c
feat: add structural pattern categorizer
jbachorik 5ea1e7d
feat: classify reusable log pattern atoms
jbachorik 90da424
feat: add linear template planning
jbachorik 092fd5a
feat: add linear template runtime matcher
jbachorik fd90e24
feat: route named linear templates
jbachorik 31eec11
feat: handle access log templates generically
jbachorik 254c604
refactor: remove access log oracle routing
jbachorik a470d08
perf: reuse linear template optional scratch
jbachorik 6b9db9d
refactor: remove access log matcher oracle
jbachorik af8f55f
test: assert real grok patterns use linear templates
jbachorik 6082430
refactor: rename linear templates to token sequences
jbachorik a4cb218
test: harden log token sequence equivalence
jbachorik b98aef6
fix: harden token sequence capture semantics
jbachorik a0873c1
test: assert anchor diagnostics
jbachorik File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
36 changes: 36 additions & 0 deletions
36
...ogjava-reggie27-bug-multiple-backreferences-to-same-group-produce-false-posi.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,36 @@ | ||
| --- | ||
| spec_id: REQ-DataDog-java-reggie-27 | ||
| source: github | ||
| source_ref: "DataDog/java-reggie#27" | ||
| title: "[bug] Multiple backreferences to same group produce false positives" | ||
| status: draft | ||
| clarity_score: null | ||
| created: 2026-05-08 | ||
| implementing_session: null | ||
| implemented_pr: null | ||
| --- | ||
|
|
||
| # [bug] Multiple backreferences to same group produce false positives | ||
|
|
||
| ## Description | ||
| When a pattern references the same capturing group more than once (e.g. `(\w+)\s+\1\s+\1`), the engine returns incorrect results. The second backreference check is not enforced, causing false positives. | ||
|
|
||
| ## Reproduction | ||
| ```java | ||
| ReggieMatcher m = Reggie.compile("(\\w+)\\s+\\1\\s+\\1"); | ||
| m.find("go go stop"); // returns true — WRONG, should be false | ||
| m.find("go go go"); // returns true — correct | ||
| ``` | ||
|
|
||
| ## Root cause | ||
| Patterns selected by `OPTIMIZED_NFA_WITH_BACKREFS` and `VARIABLE_CAPTURE_BACKREF` strategies do not correctly validate the second occurrence of a backreference to the same group. The group capture state is not properly threaded through the second backref check. | ||
|
|
||
| ## Current mitigation | ||
| `FallbackPatternDetector` detects this condition and falls back to `java.util.regex`. Patterns with 2+ references to the same group in these strategies are transparently delegated. | ||
|
|
||
| ## Fix direction | ||
| - `NFABytecodeGenerator`: ensure group capture state persists across multiple backref checks for the same group number | ||
| - `VariableCaptureBackrefBytecodeGenerator`: validate all backreferences, not just the first | ||
|
|
||
| ## Impact | ||
| High — incorrect match results (false positives) for multi-backref patterns. |
36 changes: 36 additions & 0 deletions
36
...dogjava-reggie35-pcre-inline-m-flag-inside-a-group-doesnt-activate-multiline.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,36 @@ | ||
| --- | ||
| spec_id: REQ-DataDog-java-reggie-35 | ||
| source: github | ||
| source_ref: "DataDog/java-reggie#35" | ||
| title: "[pcre] Inline (?m) flag inside a group doesn't activate multiline mode mid-pattern" | ||
| status: draft | ||
| clarity_score: null | ||
| created: 2026-05-09 | ||
| implementing_session: null | ||
| implemented_pr: null | ||
| --- | ||
|
|
||
| # [pcre] Inline (?m) flag inside a group doesn't activate multiline mode mid-pattern | ||
|
|
||
| ## Summary | ||
|
|
||
| When `(?m)` appears inside a capturing group (not at the start of the pattern), the multiline flag is not correctly activated for the surrounding `^` anchor used in that sub-expression. | ||
|
|
||
| ## Failing PCRE Test | ||
|
|
||
| - Pattern: `\n((?m)^b)` | ||
| - Input: `"a\nb\n"` | ||
| - Expected: matches with group 1 = `b` | ||
| - Actual: no match | ||
|
|
||
| **Expected gain**: +1 PCRE conformance test (Category 5) | ||
|
|
||
| ## Root Cause | ||
|
|
||
| Phase 1.2 fixed the anchor-optimization issue for patterns where `(?m)` appears globally (e.g., `(.*X|^B)`). However, when `(?m)` is embedded inline inside a sub-group, the flag-propagation logic doesn't update the anchor-matching behavior for `^` in that local scope. | ||
|
|
||
| ## Implementation Notes | ||
|
|
||
| - Phase 1.2 fixed 4 of the 5 multiline-anchor tests; this is the remaining failure | ||
| - Difficulty: Medium | ||
| - Files likely involved: `RegexParser.java` (flag propagation), NFA anchor handling |
38 changes: 38 additions & 0 deletions
38
...ogjava-reggie29-bug-unbounded-quantifier-after-lookbehind-always-fails-to-ma.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,38 @@ | ||
| --- | ||
| spec_id: REQ-DataDog-java-reggie-29 | ||
| source: github | ||
| source_ref: "DataDog/java-reggie#29" | ||
| title: "[bug] Unbounded quantifier after lookbehind always fails to match" | ||
| status: implementing | ||
| clarity_score: 85 | ||
| created: 2026-05-10 | ||
| implementing_session: impl-20260510-175457 | ||
| implemented_pr: null | ||
| --- | ||
|
|
||
| # [bug] Unbounded quantifier after lookbehind always fails to match | ||
|
|
||
| ## Description | ||
| A lookbehind assertion followed by an unbounded quantifier (`+`, `*`, `{n,}`) always returns false, even for inputs that should match. | ||
|
|
||
| ## Reproduction | ||
| ```java | ||
| ReggieMatcher m = Reggie.compile("(?<=\\d)[a-z]+"); | ||
| m.find("3abc"); // returns false — WRONG, should be true | ||
| m.find("abc"); // returns false — correct | ||
|
|
||
| // Bounded quantifier works: | ||
| Reggie.compile("(?<=\\d)[a-z]{1,4}").find("3abc"); // true — correct | ||
| ``` | ||
|
|
||
| ## Root cause | ||
| In the `DFA_UNROLLED_WITH_ASSERTIONS` path, the lookbehind position is not correctly propagated as the starting position for the unbounded quantifier's loop. The loop starts at an incorrect offset and immediately fails. | ||
|
|
||
| ## Current mitigation | ||
| `FallbackPatternDetector` detects a `ConcatNode` where a lookbehind `AssertionNode` is immediately followed by a `QuantifierNode` with `max == -1` and falls back to `java.util.regex`. | ||
|
|
||
| ## Fix direction | ||
| After a lookbehind assertion succeeds, the following quantifier loop must start from the correct post-lookbehind position, not from the start of the assertion check. | ||
|
|
||
| ## Impact | ||
| Medium — affects patterns common in tokenization and text extraction. |
36 changes: 36 additions & 0 deletions
36
...ogjava-reggie30-bug-only-first-alternative-in-lookbehind-alternation-is-chec.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,36 @@ | ||
| --- | ||
| spec_id: REQ-DataDog-java-reggie-30 | ||
| source: github | ||
| source_ref: "DataDog/java-reggie#30" | ||
| title: "[bug] Only first alternative in lookbehind alternation is checked" | ||
| status: draft | ||
| clarity_score: null | ||
| created: 2026-05-10 | ||
| implementing_session: null | ||
| implemented_pr: null | ||
| --- | ||
|
|
||
| # [bug] Only first alternative in lookbehind alternation is checked | ||
|
|
||
| ## Description | ||
| When a lookbehind assertion contains an alternation (`(?<=a|b)c`), only the first alternative is considered. Subsequent alternatives are silently ignored, causing false negatives. | ||
|
|
||
| ## Reproduction | ||
| ```java | ||
| ReggieMatcher m = Reggie.compile("(?<=a|b)c"); | ||
| m.find("ac"); // returns true — correct | ||
| m.find("bc"); // returns false — WRONG, should be true | ||
| m.find("xc"); // returns false — correct | ||
| ``` | ||
|
|
||
| ## Root cause | ||
| The `OPTIMIZED_NFA_WITH_LOOKAROUND` strategy processes lookbehind alternations but only evaluates the first branch. When the first alternative fails, the NFA does not try remaining alternatives in the lookbehind. | ||
|
|
||
| ## Current mitigation | ||
| `FallbackPatternDetector` detects an `AssertionNode(lookbehind)` whose `subPattern` directly contains an `AlternationNode`, and falls back to `java.util.regex`. | ||
|
|
||
| ## Fix direction | ||
| In `NFABytecodeGenerator` lookbehind handling: after the lookbehind subpattern fails for one alternative, iterate over all remaining alternatives rather than short-circuiting on the first failure. | ||
|
|
||
| ## Impact | ||
| Medium — incorrect false negatives for patterns using lookbehind alternatives. |
37 changes: 37 additions & 0 deletions
37
...ogjava-reggie36-pcre-lookahead-combined-with-nested-alternation-produces-wro.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,37 @@ | ||
| --- | ||
| spec_id: REQ-DataDog-java-reggie-36 | ||
| source: github | ||
| source_ref: "DataDog/java-reggie#36" | ||
| title: "[pcre] Lookahead combined with nested alternation produces wrong group captures" | ||
| status: implemented | ||
| clarity_score: 72 | ||
| created: 2026-05-11 | ||
| implementing_session: impl-20260511-102846 | ||
| implemented_pr: "https://github.com/DataDog/java-reggie/pull/59" | ||
| --- | ||
|
|
||
| # [pcre] Lookahead combined with nested alternation produces wrong group captures | ||
|
|
||
| ## Summary | ||
|
|
||
| Two PCRE tests involving lookahead assertions nested inside alternations or combined with digit-range character classes produce incorrect group captures. | ||
|
|
||
| ## Failing PCRE Tests | ||
|
|
||
| 1. Pattern `(\.\d\d((?=0)|\d(?=\d)))` on input `1.875000282` | ||
| - Inner `(?=0)` / `\d(?=\d)` alternation inside a capturing group fails to record the correct group 2 value. | ||
|
|
||
| 2. Pattern `(\.\d\d[1-9]?)\d+` on input `1.235` | ||
| - Expected group 1 = `.23`, actual = `.235` | ||
| - The `[1-9]?` optional class greedily consumes one character that should be left to `\d+`. | ||
|
|
||
| **Expected gain**: +2 PCRE conformance tests (Category 6, remaining after Phase 2.1) | ||
|
|
||
| ## Root Cause | ||
|
|
||
| These are backtracking/greedy edge cases in patterns where a lookahead sits inside an alternation within a capturing group. The NFA/DFA grouping boundary isn't preserved correctly during lookahead evaluation and the greedy quantifier does not backtrack into the optional class. | ||
|
|
||
| ## Implementation Notes | ||
|
|
||
| - Difficulty: Medium | ||
| - Files likely involved: `NFABytecodeGenerator.java`, lookahead handling in `PatternAnalyzer.java` |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,85 @@ | ||
| # Algorithmic fuzz testing against JDK regex | ||
|
|
||
| ## Motivation | ||
|
|
||
| The last three landed fixes (anchor placement, bounded quantifier upper bound, | ||
| SWAR multi-range filter) were all triggered by a user pattern that hit a | ||
| case the existing test suite didn't cover. In every case, the symptom was | ||
| "Reggie disagrees with `java.util.regex.Pattern` on a specific input/regex | ||
| pair." The bugs were not in subtle corner cases of obscure features — | ||
| they were in common shapes (`$X|Y`, `[0-9]{5}`, `[-_]?[0-9]{5,}`) that | ||
| just happened to fall outside the existing hand-written tests. | ||
|
|
||
| We need a generator-based test that constructs **syntactically valid | ||
| regexes algorithmically**, runs them against **algorithmically generated | ||
| inputs**, and asserts that Reggie and `java.util.regex.Pattern` agree on | ||
| the result. The point is not to fuzz the parser (random bytes) — it's to | ||
| **enumerate well-typed pattern shapes** and confirm Reggie matches JDK | ||
| semantics across them. | ||
|
|
||
| ## Scope (what to enumerate) | ||
|
|
||
| A grammar-driven generator producing patterns over a small alphabet | ||
| (`a`, `b`, `c`, `0`, `1`, `-`, `_`), bounded in depth and complexity: | ||
|
|
||
| - **Atoms**: literal char, char class `[abc]` / `[a-z]` / `[^...]`, `.` | ||
| - **Quantifiers**: `?`, `*`, `+`, `{n}`, `{n,}`, `{n,m}` (greedy and lazy) | ||
| - **Concat / alternation**: 2–3 levels of nesting | ||
| - **Anchors**: `^`, `$`, `\A`, `\Z`, `\z` (placement at start, end, and | ||
| *interior* of branches — the third has been the bug-magnet) | ||
| - **Groups**: capturing `(...)` and non-capturing `(?:...)` | ||
| - **Backreferences**: `\1` once a `(...)` exists earlier in the pattern | ||
| - **Flags**: `(?i)`, `(?m)`, `(?s)`, both global and inline-scoped | ||
|
|
||
| For each pattern, enumerate inputs of length 0..16 over the same | ||
| alphabet plus a few "structural" inputs (newlines, repeated runs of | ||
| each alphabet char). Skip patterns the parser refuses; skip JDK | ||
| `PatternSyntaxException`. | ||
|
|
||
| ## Oracle | ||
|
|
||
| For each (pattern, input) pair compute: | ||
|
|
||
| 1. JDK: `Pattern.matches(input)`, the iterated `Matcher.find()` sequence | ||
| (collecting all non-overlapping matches and their group spans). | ||
| 2. Reggie: `m.matches(input)`, the iterated `m.findMatch(input, start)` | ||
| sequence, and the group spans for each match. | ||
|
|
||
| Assert byte-for-byte agreement on: | ||
|
|
||
| - whether `matches()` returns true, | ||
| - the list of match `start()`/`end()` pairs, | ||
| - per-match group `start(i)`/`end(i)` for `1 <= i <= groupCount`. | ||
|
|
||
| ## Implementation notes | ||
|
|
||
| - A **shrinker** is the difference between "we have a 30-char failing | ||
| pattern" and "we have a 4-char failing pattern we can debug." Write | ||
| the generator with the property that any sub-tree of a failing | ||
| pattern is itself a valid pattern, so a shrink loop can delete | ||
| subtrees and re-check. | ||
| - Cache the compiled `ReggieMatcher` per pattern across inputs to keep | ||
| iteration time low; the codegen step dominates otherwise. | ||
| - Run the suite **offline** (not in `./gradlew check`) with a configurable | ||
| iteration count, plus a CI job that runs a smaller deterministic sample | ||
| on every PR. | ||
| - When a divergence is found, dump the pattern, the input, both results, | ||
| the strategy Reggie picked, and the generated bytecode path to a | ||
| fixture file. The fixture file becomes a regression test. | ||
|
|
||
| ## Reuse | ||
|
|
||
| `reggie-integration-tests` already has infrastructure for comparing | ||
| Reggie against external oracles for PCRE/RE2 corpora — extend it with a | ||
| JDK-Pattern oracle and a generator module, rather than starting fresh. | ||
|
|
||
| ## Out of scope (separate effort) | ||
|
|
||
| - Performance fuzzing — that's `reggie-benchmark`'s job. | ||
| - Round-trip parser fuzzing (random bytes) — different bug class. | ||
| - Cross-engine equivalence beyond JDK (RE2, PCRE) — already partially | ||
| covered by the existing integration-test corpora. | ||
|
|
||
| ## Status | ||
|
|
||
| Not yet implemented. Tracked as task #14 in the current session. |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.