diff --git a/AGENTS.md b/AGENTS.md index dd436ce..abc6124 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -425,6 +425,37 @@ Verify both: ``` ### Structural Hash Rule +**HARD RULE**: Any time you add or change a field on `DFA.DFAState`, `DFA.DFATransition`, or any +`PatternInfo` subclass that affects bytecode generation, you MUST also update +`StructuralHash.java` to include that field in the hash. Failure to do so causes the level-2 +structural cache to return a compiled class built for a different pattern, producing wrong runtime +results that are extremely hard to debug. + +Checklist when touching `DFA.DFAState`, `DFA.DFATransition`, `NFA.NFAState`, or any `PatternInfo`: +- `DFAState` field added → add it to `computeDFATopologyHash()` state-loop body +- `DFATransition` field added → add it to `computeDFATopologyHash()` transition-loop body +- `NFAState` field added → add it to `NFA.contentHashCode()` state-loop body +- New NFA anchor predicate (`NFA.hasXxx()`) added → add the corresponding flag to `StructuralHash.compute()` +- `PatternInfo` subclass field added → add it to that class's `structuralHashCode()` + +Example — `acceptanceAnchorConditions` and `entryGuard` added post-anchor fix: +```java +// DFAState: per-state acceptance anchor conditions. Use ordinal-derived bitmasks for +// anchor EnumSets, not EnumSet.hashCode(), because Enum.hashCode() is identity-based. +hash = 31 * hash + anchorBitmask(state.acceptanceAnchorConditions); + +// DFATransition: per-transition entry guard +hash = 31 * hash + anchorBitmask(entry.getValue().entryGuard); + +private static int anchorBitmask(EnumSet anchors) { + int mask = 0; + for (NFA.AnchorType anchor : anchors) { + mask |= (1 << anchor.ordinal()); + } + return mask; +} +``` + When creating `PatternInfo` subclasses, `structuralHashCode()` MUST include ALL fields affecting bytecode: ```java public int structuralHashCode() { @@ -700,7 +731,6 @@ Falling back to java.util.regex for pattern '': | Lookahead inside a quantified group | `(?:(?=\d)\d)+` | `lookahead inside quantified group` | | Lookbehind followed by unbounded quantifier | `(?<=\d)[a-z]+` | `lookbehind followed by unbounded quantifier` | | Alternation inside lookbehind | `(?<=a\|b)c` | `alternation inside lookbehind` | -| Lookbehind and lookahead used together | `(?<=\[)[^\]]+(?=\])` | `lookbehind and lookahead combined` | > **Note:** Bug 1 (multiple backreferences to same group) only applies when the analyzer selects > `OPTIMIZED_NFA_WITH_BACKREFS` or `VARIABLE_CAPTURE_BACKREF` strategy. Patterns routed through diff --git a/doc/libretti/2026-05-08-datadogjava-reggie27-bug-multiple-backreferences-to-same-group-produce-false-posi.md b/doc/libretti/2026-05-08-datadogjava-reggie27-bug-multiple-backreferences-to-same-group-produce-false-posi.md new file mode 100644 index 0000000..110deeb --- /dev/null +++ b/doc/libretti/2026-05-08-datadogjava-reggie27-bug-multiple-backreferences-to-same-group-produce-false-posi.md @@ -0,0 +1,36 @@ +--- +spec_id: REQ-DataDog-java-reggie-27 +source: github +source_ref: "DataDog/java-reggie#27" +title: "[bug] Multiple backreferences to same group produce false positives" +status: draft +clarity_score: null +created: 2026-05-08 +implementing_session: null +implemented_pr: null +--- + +# [bug] Multiple backreferences to same group produce false positives + +## Description +When a pattern references the same capturing group more than once (e.g. `(\w+)\s+\1\s+\1`), the engine returns incorrect results. The second backreference check is not enforced, causing false positives. + +## Reproduction +```java +ReggieMatcher m = Reggie.compile("(\\w+)\\s+\\1\\s+\\1"); +m.find("go go stop"); // returns true — WRONG, should be false +m.find("go go go"); // returns true — correct +``` + +## Root cause +Patterns selected by `OPTIMIZED_NFA_WITH_BACKREFS` and `VARIABLE_CAPTURE_BACKREF` strategies do not correctly validate the second occurrence of a backreference to the same group. The group capture state is not properly threaded through the second backref check. + +## Current mitigation +`FallbackPatternDetector` detects this condition and falls back to `java.util.regex`. Patterns with 2+ references to the same group in these strategies are transparently delegated. + +## Fix direction +- `NFABytecodeGenerator`: ensure group capture state persists across multiple backref checks for the same group number +- `VariableCaptureBackrefBytecodeGenerator`: validate all backreferences, not just the first + +## Impact +High — incorrect match results (false positives) for multi-backref patterns. diff --git a/doc/libretti/2026-05-09-datadogjava-reggie35-pcre-inline-m-flag-inside-a-group-doesnt-activate-multiline.md b/doc/libretti/2026-05-09-datadogjava-reggie35-pcre-inline-m-flag-inside-a-group-doesnt-activate-multiline.md new file mode 100644 index 0000000..1339a3a --- /dev/null +++ b/doc/libretti/2026-05-09-datadogjava-reggie35-pcre-inline-m-flag-inside-a-group-doesnt-activate-multiline.md @@ -0,0 +1,36 @@ +--- +spec_id: REQ-DataDog-java-reggie-35 +source: github +source_ref: "DataDog/java-reggie#35" +title: "[pcre] Inline (?m) flag inside a group doesn't activate multiline mode mid-pattern" +status: draft +clarity_score: null +created: 2026-05-09 +implementing_session: null +implemented_pr: null +--- + +# [pcre] Inline (?m) flag inside a group doesn't activate multiline mode mid-pattern + +## Summary + +When `(?m)` appears inside a capturing group (not at the start of the pattern), the multiline flag is not correctly activated for the surrounding `^` anchor used in that sub-expression. + +## Failing PCRE Test + +- Pattern: `\n((?m)^b)` +- Input: `"a\nb\n"` +- Expected: matches with group 1 = `b` +- Actual: no match + +**Expected gain**: +1 PCRE conformance test (Category 5) + +## Root Cause + +Phase 1.2 fixed the anchor-optimization issue for patterns where `(?m)` appears globally (e.g., `(.*X|^B)`). However, when `(?m)` is embedded inline inside a sub-group, the flag-propagation logic doesn't update the anchor-matching behavior for `^` in that local scope. + +## Implementation Notes + +- Phase 1.2 fixed 4 of the 5 multiline-anchor tests; this is the remaining failure +- Difficulty: Medium +- Files likely involved: `RegexParser.java` (flag propagation), NFA anchor handling diff --git a/doc/libretti/2026-05-10-datadogjava-reggie29-bug-unbounded-quantifier-after-lookbehind-always-fails-to-ma.md b/doc/libretti/2026-05-10-datadogjava-reggie29-bug-unbounded-quantifier-after-lookbehind-always-fails-to-ma.md new file mode 100644 index 0000000..a38d881 --- /dev/null +++ b/doc/libretti/2026-05-10-datadogjava-reggie29-bug-unbounded-quantifier-after-lookbehind-always-fails-to-ma.md @@ -0,0 +1,38 @@ +--- +spec_id: REQ-DataDog-java-reggie-29 +source: github +source_ref: "DataDog/java-reggie#29" +title: "[bug] Unbounded quantifier after lookbehind always fails to match" +status: implementing +clarity_score: 85 +created: 2026-05-10 +implementing_session: impl-20260510-175457 +implemented_pr: null +--- + +# [bug] Unbounded quantifier after lookbehind always fails to match + +## Description +A lookbehind assertion followed by an unbounded quantifier (`+`, `*`, `{n,}`) always returns false, even for inputs that should match. + +## Reproduction +```java +ReggieMatcher m = Reggie.compile("(?<=\\d)[a-z]+"); +m.find("3abc"); // returns false — WRONG, should be true +m.find("abc"); // returns false — correct + +// Bounded quantifier works: +Reggie.compile("(?<=\\d)[a-z]{1,4}").find("3abc"); // true — correct +``` + +## Root cause +In the `DFA_UNROLLED_WITH_ASSERTIONS` path, the lookbehind position is not correctly propagated as the starting position for the unbounded quantifier's loop. The loop starts at an incorrect offset and immediately fails. + +## Current mitigation +`FallbackPatternDetector` detects a `ConcatNode` where a lookbehind `AssertionNode` is immediately followed by a `QuantifierNode` with `max == -1` and falls back to `java.util.regex`. + +## Fix direction +After a lookbehind assertion succeeds, the following quantifier loop must start from the correct post-lookbehind position, not from the start of the assertion check. + +## Impact +Medium — affects patterns common in tokenization and text extraction. diff --git a/doc/libretti/2026-05-10-datadogjava-reggie30-bug-only-first-alternative-in-lookbehind-alternation-is-chec.md b/doc/libretti/2026-05-10-datadogjava-reggie30-bug-only-first-alternative-in-lookbehind-alternation-is-chec.md new file mode 100644 index 0000000..bfe1342 --- /dev/null +++ b/doc/libretti/2026-05-10-datadogjava-reggie30-bug-only-first-alternative-in-lookbehind-alternation-is-chec.md @@ -0,0 +1,36 @@ +--- +spec_id: REQ-DataDog-java-reggie-30 +source: github +source_ref: "DataDog/java-reggie#30" +title: "[bug] Only first alternative in lookbehind alternation is checked" +status: draft +clarity_score: null +created: 2026-05-10 +implementing_session: null +implemented_pr: null +--- + +# [bug] Only first alternative in lookbehind alternation is checked + +## Description +When a lookbehind assertion contains an alternation (`(?<=a|b)c`), only the first alternative is considered. Subsequent alternatives are silently ignored, causing false negatives. + +## Reproduction +```java +ReggieMatcher m = Reggie.compile("(?<=a|b)c"); +m.find("ac"); // returns true — correct +m.find("bc"); // returns false — WRONG, should be true +m.find("xc"); // returns false — correct +``` + +## Root cause +The `OPTIMIZED_NFA_WITH_LOOKAROUND` strategy processes lookbehind alternations but only evaluates the first branch. When the first alternative fails, the NFA does not try remaining alternatives in the lookbehind. + +## Current mitigation +`FallbackPatternDetector` detects an `AssertionNode(lookbehind)` whose `subPattern` directly contains an `AlternationNode`, and falls back to `java.util.regex`. + +## Fix direction +In `NFABytecodeGenerator` lookbehind handling: after the lookbehind subpattern fails for one alternative, iterate over all remaining alternatives rather than short-circuiting on the first failure. + +## Impact +Medium — incorrect false negatives for patterns using lookbehind alternatives. diff --git a/doc/libretti/2026-05-11-datadogjava-reggie36-pcre-lookahead-combined-with-nested-alternation-produces-wro.md b/doc/libretti/2026-05-11-datadogjava-reggie36-pcre-lookahead-combined-with-nested-alternation-produces-wro.md new file mode 100644 index 0000000..c53e232 --- /dev/null +++ b/doc/libretti/2026-05-11-datadogjava-reggie36-pcre-lookahead-combined-with-nested-alternation-produces-wro.md @@ -0,0 +1,37 @@ +--- +spec_id: REQ-DataDog-java-reggie-36 +source: github +source_ref: "DataDog/java-reggie#36" +title: "[pcre] Lookahead combined with nested alternation produces wrong group captures" +status: implemented +clarity_score: 72 +created: 2026-05-11 +implementing_session: impl-20260511-102846 +implemented_pr: "https://github.com/DataDog/java-reggie/pull/59" +--- + +# [pcre] Lookahead combined with nested alternation produces wrong group captures + +## Summary + +Two PCRE tests involving lookahead assertions nested inside alternations or combined with digit-range character classes produce incorrect group captures. + +## Failing PCRE Tests + +1. Pattern `(\.\d\d((?=0)|\d(?=\d)))` on input `1.875000282` + - Inner `(?=0)` / `\d(?=\d)` alternation inside a capturing group fails to record the correct group 2 value. + +2. Pattern `(\.\d\d[1-9]?)\d+` on input `1.235` + - Expected group 1 = `.23`, actual = `.235` + - The `[1-9]?` optional class greedily consumes one character that should be left to `\d+`. + +**Expected gain**: +2 PCRE conformance tests (Category 6, remaining after Phase 2.1) + +## Root Cause + +These are backtracking/greedy edge cases in patterns where a lookahead sits inside an alternation within a capturing group. The NFA/DFA grouping boundary isn't preserved correctly during lookahead evaluation and the greedy quantifier does not backtrack into the optional class. + +## Implementation Notes + +- Difficulty: Medium +- Files likely involved: `NFABytecodeGenerator.java`, lookahead handling in `PatternAnalyzer.java` diff --git a/doc/plans/algorithmic-fuzz-tests-vs-jdk.md b/doc/plans/algorithmic-fuzz-tests-vs-jdk.md new file mode 100644 index 0000000..097f023 --- /dev/null +++ b/doc/plans/algorithmic-fuzz-tests-vs-jdk.md @@ -0,0 +1,85 @@ +# Algorithmic fuzz testing against JDK regex + +## Motivation + +The last three landed fixes (anchor placement, bounded quantifier upper bound, +SWAR multi-range filter) were all triggered by a user pattern that hit a +case the existing test suite didn't cover. In every case, the symptom was +"Reggie disagrees with `java.util.regex.Pattern` on a specific input/regex +pair." The bugs were not in subtle corner cases of obscure features — +they were in common shapes (`$X|Y`, `[0-9]{5}`, `[-_]?[0-9]{5,}`) that +just happened to fall outside the existing hand-written tests. + +We need a generator-based test that constructs **syntactically valid +regexes algorithmically**, runs them against **algorithmically generated +inputs**, and asserts that Reggie and `java.util.regex.Pattern` agree on +the result. The point is not to fuzz the parser (random bytes) — it's to +**enumerate well-typed pattern shapes** and confirm Reggie matches JDK +semantics across them. + +## Scope (what to enumerate) + +A grammar-driven generator producing patterns over a small alphabet +(`a`, `b`, `c`, `0`, `1`, `-`, `_`), bounded in depth and complexity: + +- **Atoms**: literal char, char class `[abc]` / `[a-z]` / `[^...]`, `.` +- **Quantifiers**: `?`, `*`, `+`, `{n}`, `{n,}`, `{n,m}` (greedy and lazy) +- **Concat / alternation**: 2–3 levels of nesting +- **Anchors**: `^`, `$`, `\A`, `\Z`, `\z` (placement at start, end, and + *interior* of branches — the third has been the bug-magnet) +- **Groups**: capturing `(...)` and non-capturing `(?:...)` +- **Backreferences**: `\1` once a `(...)` exists earlier in the pattern +- **Flags**: `(?i)`, `(?m)`, `(?s)`, both global and inline-scoped + +For each pattern, enumerate inputs of length 0..16 over the same +alphabet plus a few "structural" inputs (newlines, repeated runs of +each alphabet char). Skip patterns the parser refuses; skip JDK +`PatternSyntaxException`. + +## Oracle + +For each (pattern, input) pair compute: + +1. JDK: `Pattern.matches(input)`, the iterated `Matcher.find()` sequence + (collecting all non-overlapping matches and their group spans). +2. Reggie: `m.matches(input)`, the iterated `m.findMatch(input, start)` + sequence, and the group spans for each match. + +Assert byte-for-byte agreement on: + +- whether `matches()` returns true, +- the list of match `start()`/`end()` pairs, +- per-match group `start(i)`/`end(i)` for `1 <= i <= groupCount`. + +## Implementation notes + +- A **shrinker** is the difference between "we have a 30-char failing + pattern" and "we have a 4-char failing pattern we can debug." Write + the generator with the property that any sub-tree of a failing + pattern is itself a valid pattern, so a shrink loop can delete + subtrees and re-check. +- Cache the compiled `ReggieMatcher` per pattern across inputs to keep + iteration time low; the codegen step dominates otherwise. +- Run the suite **offline** (not in `./gradlew check`) with a configurable + iteration count, plus a CI job that runs a smaller deterministic sample + on every PR. +- When a divergence is found, dump the pattern, the input, both results, + the strategy Reggie picked, and the generated bytecode path to a + fixture file. The fixture file becomes a regression test. + +## Reuse + +`reggie-integration-tests` already has infrastructure for comparing +Reggie against external oracles for PCRE/RE2 corpora — extend it with a +JDK-Pattern oracle and a generator module, rather than starting fresh. + +## Out of scope (separate effort) + +- Performance fuzzing — that's `reggie-benchmark`'s job. +- Round-trip parser fuzzing (random bytes) — different bug class. +- Cross-engine equivalence beyond JDK (RE2, PCRE) — already partially + covered by the existing integration-test corpora. + +## Status + +Not yet implemented. Tracked as task #14 in the current session. diff --git a/doc/plans/fuzz-findings-triage-EF-residual.md b/doc/plans/fuzz-findings-triage-EF-residual.md new file mode 100644 index 0000000..f7bddfc --- /dev/null +++ b/doc/plans/fuzz-findings-triage-EF-residual.md @@ -0,0 +1,143 @@ +# Cat E/F residual: anchor-condition dilution in DFA subset construction + +Triage performed against HEAD `218d487` (branch `fix/anchor-semantics`). + +## Reproducing divergences + +Algorithmic fuzz (seed `0xC0DEFEED_DEADBEEFL`, 500 patterns × 8 inputs) +yields 92 raw findings, 63 unique minimal repros. Of those, three +classes are excluded from this triage: + +- Alternation-order divergences (JDK leftmost-first vs Reggie's + leftmost-longest DFA) — accepted, requires tagged NFA or branch + priority tracking. Examples: `\A|[c]`, `(.\z)|`, `a{1}(?:1?[^a])$|b?`. +- Self-referencing backref Cat D — accepted (see prior triage doc). + +The actionable Cat E/F repros (verified to still diverge on `218d487`): + +| Pattern | Input | JDK find() | Reggie find() | matches() agree? | +|---|---|---|---|---| +| `\Z.[a]{1}\|_-` | `_a` | no match | `[0,2)` | no — Reggie `true` | +| `[ca]{2}(Z?^\|\Z)` | `cab` | no match | `[0,2)` | yes | +| `\Z[1]*\|1]` | `1` | `[1,1)` | `[0,1)` | no — Reggie `true` | +| `(1{0,}^\|]{2})` | `1` | `[0,0)` | `[0,1)` | yes | + +All four route to `DFA_UNROLLED` via `PatternDebugger.analyze()`. + +## Root cause + +`SubsetConstructor.buildDFA` (lines 89–152) computes a single +"transition guard" per partition slice by intersecting the source +anchor conditions of every NFA state that contributes a transition for +that slice (`mergeWeakest`, line 103). Intersection is the correct +"weakest precondition" merge for *one* logical path, but it is wrong +when contributors come from alternation branches with *different* +preconditions: + +``` +\Z.[a]{1}|_- on '_' + branch 1 ('.') srcCond = {STRING_END} // \Z propagated via ε-closure + branch 2 ('_') srcCond = {} // no anchor + transitionGuard = {STRING_END} ∩ {} = {} // anchor lost +``` + +After the transition fires unconditionally, both branches' NFA-state +continuations land in the post-state. Branch 1 was never "live" at +pos 0 (STRING_END false), but its post-state now appears alive, and +its accept condition can be discharged at a later position. The +pattern accepts incorrectly. + +`computeAcceptanceConditions` (lines 293–304) has the same shape: +when *any* accept state in the closure has empty condition, +acceptance is declared unconditional. For `(Z?^|\Z)` after the +outer `[ca]{2}` consumes two chars, the inner alternation's two +accept paths have disjoint conditions (`{START_MULTILINE}` vs +`{STRING_END}`) — the intersection logic again drops to `{}` and +the state is wrongly marked unconditionally accepting. + +`containsConsumeKillingAnchor` only prunes `END`/`STRING_END_ABSOLUTE`, +not `STRING_END`/`END_MULTILINE` (intentional — the latter pair admit +a trailing `\n`). The runtime entry guard for `STRING_END` *is* +implemented (`DFAUnrolledBytecodeGenerator.emitTransitionEntryGuard`, +line 3408) and would catch the bad transition, **but only if the +guard reaches it**. The dilution erases it before runtime ever sees +it. + +Working anchor-alternation patterns from `AnchorRegressionTest` +(`^[0-9]|q`, `$X|Y`, `$[^a-zA-Z0-9]|^[0-9]`) survive only by +coincidence: their alternation branches have disjoint *leading +consumer char-classes*, so no partition slice ever has contributors +from differing-anchor branches. + +## Fix options + +### A. Fallback safety net (low effort, conservative) + +Extend `FallbackPatternDetector` to detect alternations whose branches +have differing "leading positional anchor profiles": + +``` +leadingAnchorProfile(branch) = { anchors that must hold at the + entry position for any non-empty + match through the branch } +``` + +If two branches in an alternation have unequal profiles, route to +`JavaRegexFallbackMatcher` (delegates to `java.util.regex`). + +- Catches all four actionable repros. +- Also catches working patterns like `^[0-9]|q` and `$X|Y` — + perf regression on these (DFA → JDK fallback) but no + correctness change; `AnchorRegressionTest` still passes via the + JDK fallback. +- Pure AST walk, no DFA-construction state to thread. + +### B. Per-branch DFA state splitting (higher effort, correct) + +Track per-NFA-state source conditions through the DFA construction. +When a partition slice has contributors with differing source +conditions, instead of intersecting to a single guard, **split** the +post-state into one DFA state per source-condition equivalence +class. Each child state carries a distinct entry guard. + +This is structurally what the original anchor-aware DFA plan +anticipated ("paths with disjoint anchor conjunctions ... reports +back so the strategy selector can fall back to NFA"). The +"reports back" route is option A; the "doesn't fall back" route is +this option. + +- Catches the four bugs without falling back any patterns. +- Significant rework of `buildDFA`, `buildDFAWithAssertions`, and + the state-cache key shape (must include the source-condition + signature, not just the NFA-state set). +- DFA state count may grow but for these patterns the growth is + bounded (one state per branch's anchor profile). + +### C. Construction-time fallback signal (medium effort) + +When `buildDFA` detects a contributor-disagreement at a partition or +accept site, set a flag on the returned DFA. `RuntimeCompiler` / +`ReggieMatcherBytecodeGenerator` check the flag and route to +`JavaRegexFallbackMatcher`. Pure-AST detection from option A is +imprecise (catches working patterns); this detects the actual +condition that triggers the bug. + +- Most precise of the three. +- Requires plumbing the flag from `SubsetConstructor` through + `PatternAnalyzer.MatchingStrategyResult` and into the fallback + decision in two call sites. + +## Recommendation + +Option **C** if perf on `^[0-9]|q`-shaped patterns is important; +option **A** if speed of fix dominates. Option B is the correct +long-term direction but is a follow-up. + +## Verification + +- Add the four repros above to `AnchorRegressionTest` (asserting + span equivalence with `java.util.regex.Pattern`). +- Re-run `./gradlew :reggie-integration-tests:test` — the fuzz test + ceiling at 25% will not move (these four are 4/63 unique findings) + but the `AnchorRegressionTest` cases assert directly. +- Run `./gradlew check` to confirm no broader regression. diff --git a/doc/plans/fuzz-findings-triage.md b/doc/plans/fuzz-findings-triage.md new file mode 100644 index 0000000..ebaa21e --- /dev/null +++ b/doc/plans/fuzz-findings-triage.md @@ -0,0 +1,207 @@ +# Algorithmic fuzz: divergence triage + +Findings from the default seed of `AlgorithmicFuzzTest` +(seed `0xC0DEFEED_DEADBEEFL`, 500 patterns × 8 inputs). 161 raw +divergences, shrunk to 64 unique minimal repros. Each row is grouped +by what the smallest repro suggests is the underlying bug. + +## A. Lazy quantifier silently treated as greedy (highest priority) + +**Status:** investigated, fix deferred — requires architectural choice. + +### Root cause + +`RecursiveDescentBytecodeGenerator.visitQuantifier` ignores the +`!greedy` flag for quantifiers that are NOT inside a concat with a +following sibling. The comment at line ~1858 makes the choice +explicit: *"For non-greedy quantifiers, the preference for fewer +matches is handled by `generateConcatWithBacktracking` when followed +by more pattern elements. When standalone or at the end of a +pattern, always match max."* That's the bug — at the end of a +concat or at the root, the lazy preference is silently dropped. + +The shapes that hit this: + +- **Lazy at root**: `.??`, `a*?`, `(?:a)??` — visitQuantifier runs as + the outermost parser, returns greedy max. +- **Lazy at end of concat with no later sibling**: `X.??`, `X.*?` — + `visitConcat` checks `for (i = 0; i < children.size() - 1; i++)`, + so the trailing child is never considered for backtracking. + +### Why a one-line fix doesn't work + +A naive "match min only when `!greedy`" fix in `visitQuantifier` +makes `find()` correct but breaks `matches()`. `matches()` calls the +same parser and checks `result == length`. With lazy returning min, +patterns like `.??` against `"b"` would return matches=false; JDK +returns true (its engine backtracks to extend the lazy match until +the whole input is consumed). + +`concat-with-backtracking` already does the right thing — it starts +lazy at min and extends on failure — but only when the lazy +quantifier has siblings AFTER it. For trailing/root lazy, there's +no sibling to drive the failure-triggered extension. + +### Design options (none implemented yet) + +1. **Two-method emission**: emit `parse_X_greedy` (current) and + `parse_X_lazy` (min only) for each quantifier. `find()` calls the + lazy variant when the quantifier is at root or end-of-concat; + `matches()` always calls greedy. Cascades through recursive + parser dispatch. +2. **Anchored-matches transform**: at codegen time, model + `matches()` as `^pattern\z`. The trailing `\z` becomes a sibling + to the lazy quantifier, so concat-with-backtracking kicks in and + does the lazy expansion. Cleanest semantically; small surface + change; needs care with grouping. +3. **Instance flag**: add a `preferLazy` field to the matcher, + toggled by `find()`/`matches()`. `visitQuantifier` emits a + runtime branch on the field for lazy quantifiers. Smallest code + change; adds one branch per lazy quantifier match. + +Recommend **option 2** (anchored-matches transform) — it doesn't +duplicate methods or add runtime branches, and it brings Reggie's +matches() semantics closer to JDK's mental model. + +### Repros (unchanged from below; kept for context) + +The most common category. `*?`, `+?`, `??`, `{n,m}?` produce a +greedy match in Reggie while JDK produces the lazy one. + +| Pattern | Input | JDK find | Reggie find | +|---|---|---|---| +| `.??` | `b` | `[0,0)` | `[0,1)` | +| `.??` | `0` | `[0,0)` | `[0,1)` | +| `.??` | `_` | `[0,0)` | `[0,1)` | +| `(?:a)??` | `a` | `[0,0)` | `[0,1)` | +| `b??\|(){3}` | `b` | `[0,0)` | `[0,1)` | +| `\A.*?` | `a` | `[0,0)` | `[0,1)` | +| `\A.*?` | `1` | `[0,0)` | `[0,1)` | +| `[^0][0]*?` | `10` | `[0,1)` | `[0,2)` | +| `-?.{3,}?\|1{0}` | `aa0-` | `[0,0)` | `[0,3)` | + +`.??` on a single char is the canonical case. This should be the +**first** category to investigate — it likely cascades into a lot of +the broader findings. + +## B. Empty/zero-width match handling + +**Status:** mostly resolved by the Cat C fix (zero-reps ε-bypass). +After the fix, all of the following pass: `b{0,3}[c]{0}`, +`c{3}()|$`, `1?$`, `$c?`, `([^a]{2}\z|){1}`. The shared root cause +was the quantifier with `min=0` not exposing the empty-match path +through the surrounding alternation/concat — the same bug fixed in +category C. + +The remaining repro `1{0}(c{0}|]{4})|-?.{3}` against `_-0` is a +different problem: JDK picks the first alternation branch that can +match (NFA-style leftmost-first preference), Reggie picks whichever +branch matches the longest (DFA-style leftmost-longest). This is +alternation-order semantics — JDK/Perl prefers the textually first +alternative, but a classical DFA doesn't preserve that ordering. A +correct fix needs DFA ranking, a tagged-NFA execution, or branch +priority tracking through subset construction. Left as a known +divergence — separate effort. + +## C. Negated character class against single char + +| Pattern | Input | Kind | +|---|---|---| +| `[^c]c{0,3}` | `b` | matches: jdk=true, reggie=false | +| `[^c]c{0,3}` | `1` | find: jdk=true, reggie=false | +| `[^c]c{0,3}` | `1_c` | span differs | + +`[^c]c{0,3}` matches "b" (1 char not c, then 0 c's). Reggie says no. +Likely the same root cause as the {0,N} cases we already fixed in +STATELESS_LOOP, but in another codegen path now that the lower bound +plus zero-allowed upper bound combination changes the strategy. + +## D. Self-referencing backreference in alternation + +**Status:** accepted divergence. + +| Pattern | Input | Kind | +|---|---|---| +| `a\|(\1\1){1}` | `""` | matches differs, find differs | +| `[a]?(\1{2}){2}\|b` | `""` | find differs | +| `(.{3}a{1}_{3})?\1` | `""` | find differs | + +The pattern `\1` referring to group 1 from within group 1's own body +is semantically pathological — the group hasn't been captured yet, +so what does the backref match? PCRE/Perl say "the empty string" +(Reggie's behavior), JDK rejects the path entirely. Neither is +objectively wrong. + +The fuzz generator can produce these because its grammar allows a +backref to any open group, including the one currently being built. +Real-world patterns very rarely write `(\1...)` style self-loops; +the fix-vs-divergence tradeoff is not worth chasing in this pass. +Documenting as a known JDK divergence so the fuzz triage can stop +chasing variants of this shape. + +If we ever want JDK-strict matches() semantics here, the right fix +is in the parser: reject `\n` where `n` refers to a group that +encloses the backref site. The fuzz generator's `groupsInScope` +counter already excludes the currently-being-built group; adding +the same check in `RegexParser` would close the gap. + +## E. Quantified zero-width anchors + +**Status:** not yet fixed. + +| Pattern | Input | Kind | +|---|---|---| +| `\A{3,4}?(a\|[1a]+)` | (multiple) | span differs | +| `\A{3,6}\|...` | (multiple) | span differs | +| `\Z+b\|...` | (multiple) | span differs | + +Anchors quantified with `{n,m}` should still be zero-width — JDK +respects this. Reggie's Thompson builder calls +`buildCountedQuantifier`, which generates `n` separate copies of +the child fragment. For an anchor child, this creates `n` anchor +states in sequence, all of which must hold simultaneously at the +same position — fine semantically, but the surrounding NFA wiring +produces different ε-paths that diverge from JDK's behavior when +combined with alternation siblings. + +Fix idea: at parse or NFA-build time, collapse `Quantifier(anchor, +n, m, _)` to a single anchor node. Since the anchor is zero-width, +matching it 1 to ∞ times is equivalent to matching it once. + +## F. Anchor inside character class context / interior + +**Status:** mostly fixed by the +`hasStringStartAnchor → requiresStartAnchor` change in the +find()-loop optimization (DFAUnrolled + DFASwitch). + +After that fix `]\A|b` and `_{1}(\A)|_` find at the right positions. +The previous code special-cased `hasStringStartAnchor` (set whenever +*any* `\A` exists, regardless of whether all branches require it), +which made `find()` short-circuit at non-zero positions and so masked +the branches without `\A`. The corrected condition uses +`requiresStartAnchor`, which already treats `\A` as a barrier in the +all-paths analysis. + +The earlier `_{1}(\A)|_` finding is now passing. The `\Z.[a]{1}|_-` +case is the alternation-order preference problem from category B and +is tracked there. Remaining variants in this category involve `\Z` +in interior positions — extend the per-state anchor handling in +SubsetConstructor when needed. + +## Suggested execution order + +1. **A (lazy quantifiers)** — single root cause likely, big finding + reduction. Probably in `RegexParser` quantifier handling or in + the codegen for `*?` / `??`. +2. **C (negated char class + bounded zero)** — small targeted fix + per the STATELESS_LOOP pattern from `3fdeee5`. +3. **B (zero-width matches)** — needs careful review of `find()` + behaviour at empty positions in each strategy. +4. **D (self-ref backref)** — probably accept divergence (declare + the JDK behavior the expected one) and reject these patterns at + parse time, since they have no useful semantics. +5. **E + F (anchor placement)** — extend the anchor-aware DFA fix. + +Once A is closed, re-run the fuzz to re-count. Each fix should +shrink the divergence ceiling in `AlgorithmicFuzzTest`, eventually +tightening it from 25% down to ≈0%. diff --git a/doc/plans/issue-priority.md b/doc/plans/issue-priority.md new file mode 100644 index 0000000..8db5cef --- /dev/null +++ b/doc/plans/issue-priority.md @@ -0,0 +1,66 @@ +# Issue Priority Order + +Generated: 2026-05-08 + +## Group 1 — Correctness bugs with clear scope + +1. **[#27](https://github.com/DataDog/java-reggie/issues/27)** — Multiple backreferences to same group produce false positives + - Impact: HIGH (wrong results, not just false negatives) + - Fix localized to `NFABytecodeGenerator` and `VariableCaptureBackrefBytecodeGenerator` + +2. **[#35](https://github.com/DataDog/java-reggie/issues/35)** — Inline `(?m)` flag inside a group doesn't activate multiline mode mid-pattern + - Impact: +1 PCRE test, MEDIUM difficulty, narrow scope in `RegexParser.java` + +## Group 2 — Lookaround bugs (related code paths, tackle together) + +3. **[#30](https://github.com/DataDog/java-reggie/issues/30)** — Only first alternative in lookbehind alternation is checked + - Impact: MEDIUM, well-scoped NFA fix in `NFABytecodeGenerator` + +4. **[#29](https://github.com/DataDog/java-reggie/issues/29)** — Unbounded quantifier after lookbehind always fails + - Impact: MEDIUM, companion to #30, same `DFA_UNROLLED_WITH_ASSERTIONS` path + +5. **[#28](https://github.com/DataDog/java-reggie/issues/28)** — Lookahead inside quantified group produces wrong results + - Impact: HIGH, same strategy as #29/#30 + +6. **[#31](https://github.com/DataDog/java-reggie/issues/31)** — Combined lookbehind + lookahead (sandwich pattern) always fails + - Impact: HIGH — sandwich patterns are very common; fix after #28 and #29 are clean + +## Group 3 — PCRE conformance, medium difficulty + +7. **[#36](https://github.com/DataDog/java-reggie/issues/36)** — Lookahead combined with nested alternation produces wrong group captures + - Impact: +2 PCRE tests, MEDIUM difficulty + +8. **[#32](https://github.com/DataDog/java-reggie/issues/32)** — Scoped inline flags not supported (`(?i:...)`, `(?m-i:...)`) + - Impact: +4 PCRE tests, MEDIUM difficulty, parser flag push/pop + +9. **[#34](https://github.com/DataDog/java-reggie/issues/34)** — Nested groups with literal digits and backreferences produce wrong captures + - Impact: +2 PCRE tests, MEDIUM-HIGH difficulty, backref number parsing ambiguity + +## Group 4 — Feature pair with dependency + +10. **[#41](https://github.com/DataDog/java-reggie/issues/41)** — Atomic groups not supported (`(?>...)`) + - Impact: MEDIUM difficulty; prerequisite for #42 + +11. **[#42](https://github.com/DataDog/java-reggie/issues/42)** — Possessive quantifiers not supported (`*+`, `++`, `?+`, `{n,m}+`) + - Impact: trivial once #41 lands (desugar to atomic group) + +## Group 5 — Harder correctness fixes + +12. **[#37](https://github.com/DataDog/java-reggie/issues/37)** — Non-greedy (lazy) quantifiers inside capturing groups produce wrong captures + - Impact: +5 PCRE tests, HIGH difficulty — needs new `LazyQuantifierBytecodeGenerator` + +13. **[#33](https://github.com/DataDog/java-reggie/issues/33)** — Escaped-quote pattern group extraction incorrect in `DFA_UNROLLED_WITH_GROUPS` + - Impact: +1 PCRE test, HIGH difficulty — tagged DFA group tracking + +## Group 6 — Large features + +14. **[#40](https://github.com/DataDog/java-reggie/issues/40)** — Unicode property escapes not supported (`\p{L}`, `\p{N}`, etc.) + - Impact: 66+ PCRE tests filtered; MEDIUM-HIGH difficulty — large Unicode category tables + +## Group 7 — Architectural / very high difficulty + +15. **[#38](https://github.com/DataDog/java-reggie/issues/38)** — Recursive subroutine patterns (`(?1)`, `(?R)`) fail for palindrome-style checks + - Impact: +9 PCRE tests, HIGH difficulty — needs backtrackable recursion in `RecursiveDescentBytecodeGenerator` + +16. **[#39](https://github.com/DataDog/java-reggie/issues/39)** — Self-referencing backreferences not supported (e.g. `(a\1?){4}`) + - Impact: +3 PCRE tests, VERY HIGH difficulty — possible architectural change to `RecursiveDescentBytecodeGenerator` diff --git a/doc/plans/logs-backend.md b/doc/plans/logs-backend.md new file mode 100644 index 0000000..f62e07d --- /dev/null +++ b/doc/plans/logs-backend.md @@ -0,0 +1,102 @@ +# Reggie feature requirements — grok log-parsing adoption + +Context: We're evaluating Reggie 0.3.0-SNAPSHOT as a drop-in replacement for java.util.regex in the grok log-parsing pipeline (logs-processing service, ~16 µs/op on an access-log pattern). Benchmark: GrokModuleBenchmark.parse, baseline 16.4 µs/op (JDK). + +--- +## P0 — Compile blockers (throw today) + +1. Atomic groups (?>...) + +- Syntax: (?>X) — possessive non-capturing group +- Error: UnsupportedPatternException: Unsupported special group construct thrown from RegexParser.parseGroup() line 291 — no case '>' in the special-group dispatch +- Frequency: 5 core grok patterns (numberStr, numberExtStr, quotedString, unixPath, winPath); also emitted directly in match-rule bodies for optional fields ((?>%{_method} |), (?>HTTP\/...|)). Every grok job using %{number}, %{notSpace}, or path patterns hits this +- Semantics for a linear-time engine: (?>X) is purely a backtracking-prevention hint. A DFA/NFA with no backtracking can accept (?>X) and treat it as (?:X) with identical semantics — the fix is to add case '>': // atomic group, treat as non-capturing in parseGroup() and continue parsing as a standard non-capturing group + +--- +2. \Q...\E literal quoting + +- Syntax: \Qliteral text\E — quotes all metacharacters between \Q and \E +- Bug: Silent misparsing. parseEscape() default branch (line 527) converts \Q → LiteralNode('Q') and \E → LiteralNode('E'). No exception thrown; the compiled pattern silently matches the wrong text +- Frequency: Emitted by Pattern.quote() in GrokRuleBuilders.DATE for every literal separator in a date format string — e.g. dd/MMM/yyyy:HH:mm:ss Z produces [\d]{2}\Q/\E(?:Jan|...)\Q/\E[\d]{4,19}\Q:\E.... Any grok pipeline using %{date(...)} is affected +- Risk level: Higher than P0 in some ways — it doesn't fail loudly, it produces a pattern that compiles and runs but matches different strings than intended. Correctness hazard +- Fix: Add case 'Q': return parseQuotedLiteral() in parseEscape(), consuming characters until \E and emitting them as a concatenation of LiteralNodes + +--- +## P1 — Performance (after syntax is fixed, Reggie is 5.3% slower than JDK) + +Benchmark after stripping (?> → (?: to force 100% Reggie coverage: + +┌───────────────────┬─────────────┬────────┐ +│ Engine │ Score │ ±Error │ +├───────────────────┼─────────────┼────────┤ +│ JDK │ 15.63 µs/op │ ±0.30 │ +├───────────────────┼─────────────┼────────┤ +│ Reggie (stripped) │ 16.45 µs/op │ ±0.38 │ +└───────────────────┴─────────────┴────────┘ + +3. Allocation-free capture-group extraction + +- Problem: ReggieMatcher.match(String) allocates a MatchResult object on every successful match. The grok pipeline calls matcher.group(i) after every match to extract captures. JDK's Matcher is a stateful object reused across calls — group boundaries are stored as a int[] and group(i) extracts substrings lazily with no intermediate allocation +- Impact: The benchmark has 3 matching inputs out of 4. Each iteration allocates 3 MatchResult objects. With ~16 µs/op and multiple fields per match, allocation pressure is visible in GC and per-iteration cost +- Ask: Add a matchInto(String input, int[] groupStarts, int[] groupEnds) method (or a reusable Matcher-style object) that stores group boundaries in caller-provided arrays without allocating a result wrapper. The hot path in SafeGrokPattern.matches() would then call groupStarts[i] / input.substring(...) directly + +4. DFA state-budget / hybrid fallback for large alternation patterns + +- Problem: The grok IPv4/IPv6/hostname union pattern expands to ~2000 chars with deeply nested alternatives and quantifiers. For such patterns, a pure DFA can have exponentially more states than a backtracking NFA traverses in practice. JDK's NFA explores O(input × states visited) = near-linear for typical log inputs that match on the first alternative, while Reggie's DFA may precompute state sets for all possibilities upfront +- Observed: Even after fixing the above allocation issue, the DFA for this pattern is unlikely to out-perform JDK without a state-space budget +- Ask: A BacktrackConfig-style threshold (already present in the codebase) that, when the compiled DFA exceeds N states, falls back to an NFA simulation path. Or expose a Reggie.compile(pattern, Strategy.NFA) override for patterns known to have large DFAs + +--- +## P2 — Nice to have + +5. \Q...\E inside character classes [...] + +- parseCharClass() also has no \Q handler — the same silent misparsing occurs inside [...]. Low frequency in current patterns but a logical follow-on once bare \Q...\E is implemented + +6. Expose UnsupportedPatternException as a checked/public API contract + +- Currently UnsupportedPatternException extends ParseException which extends RuntimeException. Callers implementing a JDK-fallback supplier (like ReggieRegexPatternSupplier above) must catch Exception broadly. A public, stable exception type would make fallback code more precise + +--- +Key parser locations: +- parseGroup() line 291 — (?> catch-all throw +- parseEscape() line 527 — \Q/\E silent default +- Source: reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/parsing/RegexParser.java +--- + +## Final adoption architecture/status + +The logs-backend Grok access-log path now uses a native, deterministic Reggie route under +`CapturePolicy.NAMED_ONLY`: + +``` +regex AST -> PatternCategorizer -> LinearTokenSequencePlan -> LinearTokenSequenceMatcher +``` + +Important properties: + +- The route is structural: it categorizes reusable token atoms (IP/host, non-space fields, + quoted fields, integers, decimals, optional request fragments, delimiter captures, and trailing + bracketed logger capture). It does **not** route by exact pattern string or `grokN` capture names. +- `CapturePolicy.NAMED_ONLY` preserves original named group indexes so Grok can continue calling + `group(originalIndex)` after discovering names from the expanded regex. +- The old ad-hoc `AccessLogGrokMatcher` oracle has been removed; production routing now depends on + the generic categorizer/planner/runtime matcher only. +- The two real expanded logs-backend Grok patterns are committed as runtime test resources and have + regression tests proving they route through `LinearTokenSequenceMatcher`. +- JDK/Reggie named capture-boundary equivalence is tested for the real expanded patterns across + common/combined access logs, optional method/version fields, `-` byte count, empty quoted fields, + IPv6/hostname clients, and logger bracket decoys. + +Integrated benchmark after scratch-state reuse and oracle removal (`-wi 2 -i 3 -f 2 -prof gc`): + +| Engine | Score | Allocation | +|---|---:|---:| +| JDK regex | 16.210 ± 2.128 us/op | 7701.393 ± 154.805 B/op | +| Reggie native token sequence | 2.353 ± 0.161 us/op | 7682.979 ± 61.845 B/op | + +Target coverage for the logs-backend benchmark remains: + +``` +[Reggie] coverage: 2/2 native, 0/2 internal JDK fallback, 0/2 after atomic-strip, 0/2 supplier JDK fallback +``` diff --git a/doc/plans/sub-2x-perf-candidates.md b/doc/plans/sub-2x-perf-candidates.md new file mode 100644 index 0000000..0317571 --- /dev/null +++ b/doc/plans/sub-2x-perf-candidates.md @@ -0,0 +1,78 @@ +# Sub-2× perf candidates (Reggie vs JDK) + +Throughput sweep across the benchmark suite on `fix/anchor-semantics` +(post `3fdeee5`). Ratio = Reggie throughput / JDK throughput on the +same micro. Ratios are JMH single-fork, 1×warmup, 2×measurement +(noisy — error bars omitted, magnitudes meant for triage prioritization +not for shipping numbers). + +## Critical: Reggie 50-100× slower than JDK + +| Class | Bench | Reggie | JDK | Ratio | Notes | +|---|---|---|---|---|---| +| ComplexNFABenchmark | ComplexEmailLongMatch | 0.02 | 2.31 | **0.01×** | Strategy: `HYBRID_DFA_LOOKAHEAD` | +| ComplexNFABenchmark | ComplexEmailNoMatch | 0.20 | 12.37 | **0.02×** | same | +| ComplexNFABenchmark | ComplexEmailMatch | 0.08 | 4.31 | **0.02×** | same | + +Pattern: `(?=.{1,64}@)[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}` + +The HYBRID_DFA_LOOKAHEAD strategy degrades catastrophically on this +pattern. JDK's NFA backtracker handles it competently. Investigation +target: `PatternAnalyzer.analyzeAndRecommend()` may be picking the +hybrid path for a pattern shape where the recursive-descent or +optimized-NFA path would be faster. Profile the hybrid implementation +on this exact pattern. + +## Anchor-fix sites — slightly slower than JDK + +| Class | Bench | Reggie | JDK | Ratio | Notes | +|---|---|---|---|---|---| +| AnchorPlacementBenchmark | AtEndConcat_match | 9.32 | 11.79 | 0.79× | `xyz#$`, DFA_UNROLLED | +| AnchorPlacementBenchmark | AlternationMixed_endBranch | 29.85 | 33.55 | 0.89× | `^a\|z$` | +| AnchorPlacementBenchmark | AlternationMixed_startBranch | 29.74 | 33.17 | 0.90× | same | + +Per-state acceptance-condition checks emitted at every accept site by +`emitAcceptanceAnchorChecks`. The check is fast (one or two compares) +but adds branches in the hot path. Likely fixable by hoisting the +length-load out of the inner state loop, or by short-circuiting when +the EnumSet is known to be `{END}` only. + +## Lookbehind + backref — near-parity + +| Class | Bench | Reggie | JDK | Ratio | +|---|---|---|---|---| +| ComplexNFABenchmark | LookbehindBackrefMatch | 6.16 | 6.27 | 0.98× | +| ComplexNFABenchmark | LookbehindBackrefNoMatch | 4.84 | 4.83 | 1.00× | + +Pattern: `(?<=prefix)(\w+)\1(?=suffix)`. Reggie is at parity, which is +worth investigating because we'd expect a DFA-class engine to beat +JDK's backtracker. The hybrid path likely doesn't kick in for combined +lookbehind+backref patterns. + +## Borderline (1.0×–2.0×) — investigate after the above + +| Class | Bench | Ratio | Notes | +|---|---|---|---| +| AnchorPlacementBenchmark | UserPattern_leadingDigit | 1.04× | `$[^a-zA-Z0-9]\|^[0-9]` — only barely ahead of JDK | +| StringAnchorBenchmark | StringEnd_long_noNewline | 1.14× | `.*suffix\Z` matching no-newline input | +| BackreferenceBenchmark | SelfRefFourGroupsNoMatch | 1.19× | Self-referencing backreference, no match | +| AnchorPlacementBenchmark | TrailingZeros_noMatch | 1.30× | `\.?0+$` against non-matching input | +| GroupExtractionBenchmark | EmailGroups | 1.32× | Group-capture path | +| NamedGroupExtractionBenchmark | LogSpansByName | 1.35× | Named group, find by name | +| BackreferenceBenchmark | SelfRefFourGroupsMatch4 | 1.64× | Self-referencing backreference, match | + +Most of these are group-capture or backreference workloads — areas +where Reggie's overhead for setting up the tagged-DFA tag arrays +shows. Reasonable to leave as-is unless a consumer reports a hot path. + +## Verification methodology + +``` +./gradlew :reggie-benchmark:jmh \ + -Pjmh.args="(AnchorPlacement|StringAnchor|MatchOperation|FindOperation|Replacement|StateExplosion|BranchReset|ComplexNFA|NamedGroupExtraction|GroupExtraction|Backreference|Conditional|Assertion).*Benchmark -wi 1 -i 2 -f 1 -tu us -bm thrpt" +python3 /tmp/pairup_benchmarks.py reggie-benchmark/build/reports/jmh/results.json 2.0 +``` + +The pair-up script lives at `/tmp/pairup_benchmarks.py` in this +session; should be checked into `reggie-benchmark/scripts/` if +adopted. diff --git a/reggie-benchmark/src/main/java/com/datadoghq/reggie/benchmark/AnchorPlacementBenchmark.java b/reggie-benchmark/src/main/java/com/datadoghq/reggie/benchmark/AnchorPlacementBenchmark.java new file mode 100644 index 0000000..d22c458 --- /dev/null +++ b/reggie-benchmark/src/main/java/com/datadoghq/reggie/benchmark/AnchorPlacementBenchmark.java @@ -0,0 +1,201 @@ +/* + * Copyright 2026-Present Datadog, Inc. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package com.datadoghq.reggie.benchmark; + +import com.datadoghq.reggie.Reggie; +import com.datadoghq.reggie.runtime.ReggieMatcher; +import java.util.concurrent.TimeUnit; +import java.util.regex.Pattern; +import org.openjdk.jmh.annotations.*; + +/** + * Benchmarks for anchor-placement patterns whose semantics were corrected by the anchor-aware DFA + * construction change. Each pattern is exercised through both {@link java.util.regex.Pattern}'s + * {@code matcher().find()} and Reggie's {@link ReggieMatcher#findMatch} on a non-trivial input so + * the cost of the per-state anchor checks is visible. + */ +@BenchmarkMode(Mode.Throughput) +@OutputTimeUnit(TimeUnit.MICROSECONDS) +@State(Scope.Benchmark) +@Warmup(iterations = 3, time = 1) +@Measurement(iterations = 5, time = 1) +@Fork(1) +public class AnchorPlacementBenchmark { + + /** A 128-char body with trailing zero run — exercises {@code \\.?0+$} matching at end. */ + private static final String TRAILING_ZEROS = "12345.6789012345678901234567890" + "0".repeat(97); + + /** A 128-char body whose first char is a digit — exercises {@code ^[0-9]} hot path. */ + private static final String LEADS_WITH_DIGIT = "1" + "abcdefghij".repeat(12) + "xyz"; + + /** A 128-char body that ends with a non-alphanumeric — the user's original pattern. */ + private static final String ENDS_WITH_SYMBOL = "abcdefghij".repeat(12) + "xyz#"; + + /** Plain identifier — exercises the "no match anywhere" path for the original user pattern. */ + private static final String PLAIN_IDENT = "abcdefghij".repeat(12) + "xyz0"; + + /** Long input used by the bare-{@code $} benchmark — find must scan to the end. */ + private static final String LONG_INPUT = "abcdefghij".repeat(50); + + // Patterns — kept exactly as the bug report and the consuming codebase use them. + private ReggieMatcher reggieTrailingZeros; + private Pattern jdkTrailingZeros; + private ReggieMatcher reggieUserPattern; + private Pattern jdkUserPattern; + private ReggieMatcher reggieStartDigit; + private Pattern jdkStartDigit; + private ReggieMatcher reggieBareDollar; + private Pattern jdkBareDollar; + private ReggieMatcher reggieAtEndConcat; + private Pattern jdkAtEndConcat; + private ReggieMatcher reggieAlternationMixed; + private Pattern jdkAlternationMixed; + + @Setup + public void setup() { + reggieTrailingZeros = Reggie.compile("\\.?0+$"); + jdkTrailingZeros = Pattern.compile("\\.?0+$"); + + reggieUserPattern = Reggie.compile("$[^a-zA-Z0-9]|^[0-9]"); + jdkUserPattern = Pattern.compile("$[^a-zA-Z0-9]|^[0-9]"); + + reggieStartDigit = Reggie.compile("^[0-9]"); + jdkStartDigit = Pattern.compile("^[0-9]"); + + reggieBareDollar = Reggie.compile("$"); + jdkBareDollar = Pattern.compile("$"); + + // End-anchor at end of concat: already worked pre-fix, ensure we did not regress. + reggieAtEndConcat = Reggie.compile("xyz#$"); + jdkAtEndConcat = Pattern.compile("xyz#$"); + + // Mixed alternation with start and end anchors in different branches. + reggieAlternationMixed = Reggie.compile("^a|z$"); + jdkAlternationMixed = Pattern.compile("^a|z$"); + } + + // ---- \.?0+$ -------------------------------------------------------------------------- + + @Benchmark + public boolean reggieTrailingZeros_match() { + return reggieTrailingZeros.findMatch(TRAILING_ZEROS) != null; + } + + @Benchmark + public boolean jdkTrailingZeros_match() { + return jdkTrailingZeros.matcher(TRAILING_ZEROS).find(); + } + + @Benchmark + public boolean reggieTrailingZeros_noMatch() { + return reggieTrailingZeros.findMatch(PLAIN_IDENT) != null; + } + + @Benchmark + public boolean jdkTrailingZeros_noMatch() { + return jdkTrailingZeros.matcher(PLAIN_IDENT).find(); + } + + // ---- $[^a-zA-Z0-9]|^[0-9] — the original bug report ---------------------------------- + + @Benchmark + public boolean reggieUserPattern_leadingDigit() { + return reggieUserPattern.findMatch(LEADS_WITH_DIGIT) != null; + } + + @Benchmark + public boolean jdkUserPattern_leadingDigit() { + return jdkUserPattern.matcher(LEADS_WITH_DIGIT).find(); + } + + @Benchmark + public boolean reggieUserPattern_noMatch() { + return reggieUserPattern.findMatch(ENDS_WITH_SYMBOL) != null; + } + + @Benchmark + public boolean jdkUserPattern_noMatch() { + return jdkUserPattern.matcher(ENDS_WITH_SYMBOL).find(); + } + + // ---- ^[0-9] alone (find on a long string) ---------------------------------------------- + + @Benchmark + public boolean reggieStartDigit_match() { + return reggieStartDigit.findMatch(LEADS_WITH_DIGIT) != null; + } + + @Benchmark + public boolean jdkStartDigit_match() { + return jdkStartDigit.matcher(LEADS_WITH_DIGIT).find(); + } + + @Benchmark + public boolean reggieStartDigit_noMatch() { + return reggieStartDigit.findMatch(LONG_INPUT) != null; + } + + @Benchmark + public boolean jdkStartDigit_noMatch() { + return jdkStartDigit.matcher(LONG_INPUT).find(); + } + + // ---- Bare $ — must scan to end-of-input ------------------------------------------------ + + @Benchmark + public boolean reggieBareDollar_match() { + return reggieBareDollar.findMatch(LONG_INPUT) != null; + } + + @Benchmark + public boolean jdkBareDollar_match() { + return jdkBareDollar.matcher(LONG_INPUT).find(); + } + + // ---- x$ — end-anchor at end of concat (pre-fix worked, regression check) -------------- + + @Benchmark + public boolean reggieAtEndConcat_match() { + return reggieAtEndConcat.findMatch(ENDS_WITH_SYMBOL) != null; + } + + @Benchmark + public boolean jdkAtEndConcat_match() { + return jdkAtEndConcat.matcher(ENDS_WITH_SYMBOL).find(); + } + + // ---- ^a|z$ — mixed alternation, both branches viable ----------------------------------- + + @Benchmark + public boolean reggieAlternationMixed_startBranch() { + return reggieAlternationMixed.findMatch("a" + LONG_INPUT) != null; + } + + @Benchmark + public boolean jdkAlternationMixed_startBranch() { + return jdkAlternationMixed.matcher("a" + LONG_INPUT).find(); + } + + @Benchmark + public boolean reggieAlternationMixed_endBranch() { + return reggieAlternationMixed.findMatch(LONG_INPUT + "z") != null; + } + + @Benchmark + public boolean jdkAlternationMixed_endBranch() { + return jdkAlternationMixed.matcher(LONG_INPUT + "z").find(); + } +} diff --git a/reggie-benchmark/src/main/java/com/datadoghq/reggie/benchmark/DFATableBenchmark.java b/reggie-benchmark/src/main/java/com/datadoghq/reggie/benchmark/DFATableBenchmark.java new file mode 100644 index 0000000..2474616 --- /dev/null +++ b/reggie-benchmark/src/main/java/com/datadoghq/reggie/benchmark/DFATableBenchmark.java @@ -0,0 +1,113 @@ +/* + * Copyright 2026-Present Datadog, Inc. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package com.datadoghq.reggie.benchmark; + +import com.datadoghq.reggie.Reggie; +import com.datadoghq.reggie.runtime.ReggieMatcher; +import java.util.concurrent.TimeUnit; +import java.util.regex.Pattern; +import org.openjdk.jmh.annotations.Benchmark; +import org.openjdk.jmh.annotations.BenchmarkMode; +import org.openjdk.jmh.annotations.Fork; +import org.openjdk.jmh.annotations.Measurement; +import org.openjdk.jmh.annotations.Mode; +import org.openjdk.jmh.annotations.OutputTimeUnit; +import org.openjdk.jmh.annotations.Scope; +import org.openjdk.jmh.annotations.Setup; +import org.openjdk.jmh.annotations.State; +import org.openjdk.jmh.annotations.Warmup; +import org.openjdk.jmh.infra.Blackhole; + +/** Benchmarks large pure regular DFAs routed through the compact DFA_TABLE backend. */ +@BenchmarkMode(Mode.AverageTime) +@OutputTimeUnit(TimeUnit.NANOSECONDS) +@Warmup(iterations = 3, time = 1) +@Measurement(iterations = 5, time = 1) +@Fork(2) +@State(Scope.Thread) +public class DFATableBenchmark { + private static final String PATTERN = "(?:[a-z][0-9]){150}"; + + private Pattern jdk; + private ReggieMatcher reggie; + private String matchingInput; + private String searchInput; + private String nonMatchingInput; + private String noStartCharSearchInput; + private int[] bounds; + + @Setup + public void setup() { + jdk = Pattern.compile(PATTERN); + reggie = Reggie.compile(PATTERN); + matchingInput = "a1".repeat(150); + searchInput = "prefix-" + matchingInput + "-suffix"; + nonMatchingInput = matchingInput + "x"; + noStartCharSearchInput = "-".repeat(1024); + bounds = new int[2]; + } + + @Benchmark + public boolean jdkMatches() { + return jdk.matcher(matchingInput).matches(); + } + + @Benchmark + public boolean reggieMatches() { + return reggie.matches(matchingInput); + } + + @Benchmark + public boolean jdkFind(Blackhole bh) { + java.util.regex.Matcher matcher = jdk.matcher(searchInput); + boolean found = matcher.find(); + if (found) { + bh.consume(matcher.start()); + bh.consume(matcher.end()); + } + return found; + } + + @Benchmark + public boolean reggieFindBounds(Blackhole bh) { + boolean found = reggie.findBoundsFrom(searchInput, 0, bounds); + if (found) { + bh.consume(bounds[0]); + bh.consume(bounds[1]); + } + return found; + } + + @Benchmark + public boolean jdkFindNoStartChar() { + return jdk.matcher(noStartCharSearchInput).find(); + } + + @Benchmark + public boolean reggieFindBoundsNoStartChar() { + return reggie.findBoundsFrom(noStartCharSearchInput, 0, bounds); + } + + @Benchmark + public boolean jdkNonMatch() { + return jdk.matcher(nonMatchingInput).matches(); + } + + @Benchmark + public boolean reggieNonMatch() { + return reggie.matches(nonMatchingInput); + } +} diff --git a/reggie-benchmark/src/main/java/com/datadoghq/reggie/benchmark/LogsBackendGrokBenchmark.java b/reggie-benchmark/src/main/java/com/datadoghq/reggie/benchmark/LogsBackendGrokBenchmark.java new file mode 100644 index 0000000..e557221 --- /dev/null +++ b/reggie-benchmark/src/main/java/com/datadoghq/reggie/benchmark/LogsBackendGrokBenchmark.java @@ -0,0 +1,175 @@ +/* + * Copyright 2026-Present Datadog, Inc. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package com.datadoghq.reggie.benchmark; + +import com.datadoghq.reggie.runtime.MatchResult; +import com.datadoghq.reggie.runtime.ReggieMatcher; +import com.datadoghq.reggie.runtime.RuntimeCompiler; +import java.util.concurrent.TimeUnit; +import java.util.regex.Matcher; +import java.util.regex.Pattern; +import org.openjdk.jmh.annotations.Benchmark; +import org.openjdk.jmh.annotations.BenchmarkMode; +import org.openjdk.jmh.annotations.Fork; +import org.openjdk.jmh.annotations.Measurement; +import org.openjdk.jmh.annotations.Mode; +import org.openjdk.jmh.annotations.OutputTimeUnit; +import org.openjdk.jmh.annotations.Scope; +import org.openjdk.jmh.annotations.Setup; +import org.openjdk.jmh.annotations.State; +import org.openjdk.jmh.annotations.Warmup; + +/** + * Grok-like access-log parsing benchmark for logs-backend adoption work. + * + *

The pattern intentionally includes constructs used by grok-generated regexes: + * + *

    + *
  • atomic groups, e.g. {@code (?>...)} + *
  • {@code \Q...\E} quoted separators from {@code Pattern.quote()} + *
  • capture extraction for multiple fields + *
+ */ +@BenchmarkMode(Mode.AverageTime) +@OutputTimeUnit(TimeUnit.MICROSECONDS) +@State(Scope.Thread) +@Warmup(iterations = 2, time = 1) +@Measurement(iterations = 3, time = 1) +@Fork(1) +public class LogsBackendGrokBenchmark { + + private static final String MONTHS = "(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)"; + + private static final String ACCESS_LOG_PATTERN = + "^" + + "((?>\\d{1,3}\\.){3}\\d{1,3})" + + " ([^ ]+)" + + " ([^ ]+)" + + " \\Q[\\E" + + "(\\d{2}\\Q/\\E" + + MONTHS + + "\\Q/\\E\\d{4}\\Q:\\E\\d{2}\\Q:\\E\\d{2}\\Q:\\E\\d{2} [+-]\\d{4})" + + "\\Q]\\E" + + " \\\"(GET|POST|PUT|DELETE|PATCH|HEAD|OPTIONS) ([^\\\" ]*) (HTTP/\\d\\.\\d)\\\"" + + " (\\d{3})" + + " (\\d+|-)" + + " \\\"([^\\\"]*)\\\"" + + " \\\"([^\\\"]*)\\\"" + + "$"; + + private static final String[] INPUTS = { + "127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] \"GET /apache_pb.gif HTTP/1.0\" 200 2326 \"http://www.example.com/start.html\" \"Mozilla/4.08 [en] (Win98; I ;Nav)\"", + "192.168.1.10 - jane [07/Feb/2026:09:14:03 +0000] \"POST /api/v1/orders?id=123 HTTP/1.1\" 201 842 \"-\" \"curl/8.0.1\"", + "10.0.44.12 - svc [25/Dec/2025:23:59:59 +0100] \"DELETE /resource/abc-def HTTP/2.0\" 204 - \"https://example.org/ref\" \"logs-backend-benchmark\"", + "not an access log line" + }; + + private Pattern jdkPattern; + private ReggieMatcher reggieMatcher; + private int[] starts; + private int[] ends; + + @Setup + public void setup() { + RuntimeCompiler.clearCache(); + jdkPattern = Pattern.compile(ACCESS_LOG_PATTERN); + reggieMatcher = RuntimeCompiler.compile(ACCESS_LOG_PATTERN); + starts = new int[12]; + ends = new int[12]; + } + + @Benchmark + public int jdkParseAndExtract() { + int total = 0; + for (String input : INPUTS) { + Matcher matcher = jdkPattern.matcher(input); + if (matcher.matches()) { + for (int group = 1; group <= 11; group++) { + total += matcher.group(group).length(); + } + } + } + return total; + } + + @Benchmark + public int jdkParseBoundsOnly() { + int total = 0; + for (String input : INPUTS) { + Matcher matcher = jdkPattern.matcher(input); + if (matcher.matches()) { + for (int group = 1; group <= 11; group++) { + total += matcher.start(group) + matcher.end(group); + } + } + } + return total; + } + + @Benchmark + public int reggieMatchResultParseAndExtract() { + int total = 0; + for (String input : INPUTS) { + MatchResult match = reggieMatcher.match(input); + if (match != null) { + for (int group = 1; group <= 11; group++) { + total += match.group(group).length(); + } + } + } + return total; + } + + @Benchmark + public int reggieMatchResultBoundsOnly() { + int total = 0; + for (String input : INPUTS) { + MatchResult match = reggieMatcher.match(input); + if (match != null) { + for (int group = 1; group <= 11; group++) { + total += match.start(group) + match.end(group); + } + } + } + return total; + } + + @Benchmark + public int reggieMatchIntoParseAndExtract() { + int total = 0; + for (String input : INPUTS) { + if (reggieMatcher.matchInto(input, starts, ends)) { + for (int group = 1; group <= 11; group++) { + total += input.substring(starts[group], ends[group]).length(); + } + } + } + return total; + } + + @Benchmark + public int reggieMatchIntoBoundsOnly() { + int total = 0; + for (String input : INPUTS) { + if (reggieMatcher.matchInto(input, starts, ends)) { + for (int group = 1; group <= 11; group++) { + total += starts[group] + ends[group]; + } + } + } + return total; + } +} diff --git a/reggie-benchmark/src/main/java/com/datadoghq/reggie/benchmark/NFAFallbackPatterns.java b/reggie-benchmark/src/main/java/com/datadoghq/reggie/benchmark/NFAFallbackPatterns.java index 4e2e916..cde37ab 100644 --- a/reggie-benchmark/src/main/java/com/datadoghq/reggie/benchmark/NFAFallbackPatterns.java +++ b/reggie-benchmark/src/main/java/com/datadoghq/reggie/benchmark/NFAFallbackPatterns.java @@ -48,7 +48,11 @@ public abstract class NFAFallbackPatterns implements ReggiePatterns { @RegexPattern("(\\d{3})-(\\d+)-(\\d{4})") public abstract ReggieMatcher phoneWithVariableLength(); - @RegexPattern("(<\\w+>).*?()") + // Original: (<\w+>).*?() — lazy .*? falls back to java.util.regex because + // RECURSIVE_DESCENT lacks general alternation backtracking (see FallbackPatternDetector). + // Using greedy .* here; .* overlaps with '<', so the concat triggers backtracking via + // requiresBacktrackingForGroups and still routes through RECURSIVE_DESCENT. + @RegexPattern("(<\\w+>).*()") public abstract ReggieMatcher xmlTags(); // ==================== diff --git a/reggie-benchmark/src/main/java/com/datadoghq/reggie/benchmark/NFAMatchIntoBenchmark.java b/reggie-benchmark/src/main/java/com/datadoghq/reggie/benchmark/NFAMatchIntoBenchmark.java new file mode 100644 index 0000000..244ce61 --- /dev/null +++ b/reggie-benchmark/src/main/java/com/datadoghq/reggie/benchmark/NFAMatchIntoBenchmark.java @@ -0,0 +1,136 @@ +/* + * Copyright 2026-Present Datadog, Inc. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package com.datadoghq.reggie.benchmark; + +import com.datadoghq.reggie.Reggie; +import com.datadoghq.reggie.runtime.MatchResult; +import com.datadoghq.reggie.runtime.ReggieMatcher; +import java.util.concurrent.TimeUnit; +import java.util.regex.Matcher; +import java.util.regex.Pattern; +import org.openjdk.jmh.annotations.Benchmark; +import org.openjdk.jmh.annotations.BenchmarkMode; +import org.openjdk.jmh.annotations.Fork; +import org.openjdk.jmh.annotations.Measurement; +import org.openjdk.jmh.annotations.Mode; +import org.openjdk.jmh.annotations.OutputTimeUnit; +import org.openjdk.jmh.annotations.Scope; +import org.openjdk.jmh.annotations.Setup; +import org.openjdk.jmh.annotations.State; +import org.openjdk.jmh.annotations.Warmup; + +/** Benchmark for generated NFA capture extraction via matchInto(). */ +@BenchmarkMode(Mode.AverageTime) +@OutputTimeUnit(TimeUnit.NANOSECONDS) +@State(Scope.Thread) +@Warmup(iterations = 3, time = 1) +@Measurement(iterations = 5, time = 1) +@Fork(1) +public class NFAMatchIntoBenchmark { + private static final String NFA_PATTERN = "(?=.*[0-9])([a-z]+)([0-9]+)"; + private static final String[] INPUTS = {"abc123", "xyz98765", "letters", "a1"}; + + private Pattern jdkPattern; + private ReggieMatcher reggieMatcher; + private int[] starts; + private int[] ends; + + @Setup + public void setup() throws ReflectiveOperationException { + jdkPattern = Pattern.compile(NFA_PATTERN); + reggieMatcher = Reggie.compile(NFA_PATTERN); + starts = new int[3]; + ends = new int[3]; + + Class declaringClass = + reggieMatcher + .getClass() + .getMethod("matchInto", String.class, int[].class, int[].class) + .getDeclaringClass(); + if (declaringClass == ReggieMatcher.class) { + throw new IllegalStateException( + "benchmark pattern does not use generated matchInto override"); + } + } + + @Benchmark + public int jdkParseBoundsOnly() { + int total = 0; + for (String input : INPUTS) { + Matcher matcher = jdkPattern.matcher(input); + if (matcher.matches()) { + for (int group = 0; group <= 2; group++) { + total += matcher.start(group) + matcher.end(group); + } + } + } + return total; + } + + @Benchmark + public int reggieMatchResultBoundsOnly() { + int total = 0; + for (String input : INPUTS) { + MatchResult match = reggieMatcher.match(input); + if (match != null) { + for (int group = 0; group <= 2; group++) { + total += match.start(group) + match.end(group); + } + } + } + return total; + } + + @Benchmark + public int reggieMatchIntoBoundsOnly() { + int total = 0; + for (String input : INPUTS) { + if (reggieMatcher.matchInto(input, starts, ends)) { + for (int group = 0; group <= 2; group++) { + total += starts[group] + ends[group]; + } + } + } + return total; + } + + @Benchmark + public int reggieMatchResultParseAndExtract() { + int total = 0; + for (String input : INPUTS) { + MatchResult match = reggieMatcher.match(input); + if (match != null) { + for (int group = 1; group <= 2; group++) { + total += match.group(group).length(); + } + } + } + return total; + } + + @Benchmark + public int reggieMatchIntoParseAndExtract() { + int total = 0; + for (String input : INPUTS) { + if (reggieMatcher.matchInto(input, starts, ends)) { + for (int group = 1; group <= 2; group++) { + total += input.substring(starts[group], ends[group]).length(); + } + } + } + return total; + } +} diff --git a/reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/analysis/CaptureProjection.java b/reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/analysis/CaptureProjection.java new file mode 100644 index 0000000..3950030 --- /dev/null +++ b/reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/analysis/CaptureProjection.java @@ -0,0 +1,127 @@ +/* + * Copyright 2026-Present Datadog, Inc. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package com.datadoghq.reggie.codegen.analysis; + +import com.datadoghq.reggie.codegen.ast.*; +import java.util.ArrayList; +import java.util.HashSet; +import java.util.List; +import java.util.Set; + +/** AST-level capture projection utilities. */ +public final class CaptureProjection { + private CaptureProjection() {} + + /** + * Rewrites unnamed captures that are not needed by regex semantics into non-capturing groups. + * Named groups keep their original group numbers so callers that discovered group indexes from + * the original regex (for example Grok) can keep using those indexes. + */ + public static RegexNode preserveNamedAndSemanticCaptures(RegexNode ast) { + Set semanticGroups = new HashSet<>(); + collectSemanticGroupReferences(ast, semanticGroups); + return rewrite(ast, semanticGroups); + } + + private static void collectSemanticGroupReferences(RegexNode node, Set semanticGroups) { + if (node instanceof BackreferenceNode) { + semanticGroups.add(((BackreferenceNode) node).groupNumber); + } else if (node instanceof ConditionalNode) { + ConditionalNode conditional = (ConditionalNode) node; + semanticGroups.add(conditional.condition); + collectSemanticGroupReferences(conditional.thenBranch, semanticGroups); + if (conditional.elseBranch != null) { + collectSemanticGroupReferences(conditional.elseBranch, semanticGroups); + } + } else if (node instanceof SubroutineNode) { + SubroutineNode subroutine = (SubroutineNode) node; + if (subroutine.groupNumber > 0) { + semanticGroups.add(subroutine.groupNumber); + } + } else if (node instanceof GroupNode) { + collectSemanticGroupReferences(((GroupNode) node).child, semanticGroups); + } else if (node instanceof QuantifierNode) { + collectSemanticGroupReferences(((QuantifierNode) node).child, semanticGroups); + } else if (node instanceof ConcatNode) { + for (RegexNode child : ((ConcatNode) node).children) { + collectSemanticGroupReferences(child, semanticGroups); + } + } else if (node instanceof AlternationNode) { + for (RegexNode alternative : ((AlternationNode) node).alternatives) { + collectSemanticGroupReferences(alternative, semanticGroups); + } + } else if (node instanceof AssertionNode) { + collectSemanticGroupReferences(((AssertionNode) node).subPattern, semanticGroups); + } else if (node instanceof BranchResetNode) { + for (RegexNode alternative : ((BranchResetNode) node).alternatives) { + collectSemanticGroupReferences(alternative, semanticGroups); + } + } + } + + private static RegexNode rewrite(RegexNode node, Set semanticGroups) { + if (node instanceof GroupNode) { + GroupNode group = (GroupNode) node; + RegexNode child = rewrite(group.child, semanticGroups); + boolean keepCapturing = + group.capturing && (group.name != null || semanticGroups.contains(group.groupNumber)); + return new GroupNode(child, keepCapturing ? group.groupNumber : 0, keepCapturing, group.name); + } + if (node instanceof QuantifierNode) { + QuantifierNode quantifier = (QuantifierNode) node; + return new QuantifierNode( + rewrite(quantifier.child, semanticGroups), + quantifier.min, + quantifier.max, + quantifier.greedy); + } + if (node instanceof ConcatNode) { + List children = new ArrayList<>(); + for (RegexNode child : ((ConcatNode) node).children) { + children.add(rewrite(child, semanticGroups)); + } + return new ConcatNode(children); + } + if (node instanceof AlternationNode) { + List alternatives = new ArrayList<>(); + for (RegexNode alternative : ((AlternationNode) node).alternatives) { + alternatives.add(rewrite(alternative, semanticGroups)); + } + return new AlternationNode(alternatives); + } + if (node instanceof AssertionNode) { + AssertionNode assertion = (AssertionNode) node; + return new AssertionNode( + assertion.type, rewrite(assertion.subPattern, semanticGroups), assertion.fixedWidth); + } + if (node instanceof ConditionalNode) { + ConditionalNode conditional = (ConditionalNode) node; + return new ConditionalNode( + conditional.condition, + rewrite(conditional.thenBranch, semanticGroups), + conditional.elseBranch != null ? rewrite(conditional.elseBranch, semanticGroups) : null); + } + if (node instanceof BranchResetNode) { + BranchResetNode branchReset = (BranchResetNode) node; + List alternatives = new ArrayList<>(); + for (RegexNode alternative : branchReset.alternatives) { + alternatives.add(rewrite(alternative, semanticGroups)); + } + return new BranchResetNode(alternatives, branchReset.maxGroupNumber); + } + return node; + } +} diff --git a/reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/analysis/FallbackPatternDetector.java b/reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/analysis/FallbackPatternDetector.java index 66ed5b9..09bf355 100644 --- a/reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/analysis/FallbackPatternDetector.java +++ b/reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/analysis/FallbackPatternDetector.java @@ -29,6 +29,9 @@ import com.datadoghq.reggie.codegen.ast.RegexNode; import com.datadoghq.reggie.codegen.ast.RegexVisitor; import com.datadoghq.reggie.codegen.ast.SubroutineNode; +import java.util.HashSet; +import java.util.List; +import java.util.Set; /** * Detects regex patterns that trigger known correctness bugs in the reggie engine. When a bug is @@ -55,13 +58,284 @@ public static String needsFallback(RegexNode ast, PatternAnalyzer.MatchingStrate return "lookahead inside quantified group"; } - // Bug 5: lookbehind and lookahead used together (sandwich / interaction) - if (v.hasLookbehind && v.hasLookahead) { - return "lookbehind and lookahead combined"; + // Anchor inside a quantifier (e.g. ${2}, \z{n}) creates unusual NFA/DFA shapes that the + // current generators don't handle correctly. + if (v.hasAnchorInQuantifier) { + return "anchor inside quantifier: ${n}, \\z{n}, etc."; } + + // END-type anchor ($, \Z) immediately before a char consumer within a concat: Reggie's DFA + // prunes consuming transitions from END-conditioned states, missing the valid "$ then consume + // final \\n" path that Java regex allows. Route to JDK for correct semantics. + if (hasEndAnchorBeforeConsumer(ast)) { + return "end-anchor before consumer: $ or \\Z followed by char-consuming element"; + } + + // RECURSIVE_DESCENT uses a greedy-first descent parser with limited backtracking (quantifiers + // followed by fixed suffixes). It does NOT implement general alternation backtracking: when an + // alternation's first branch partially matches but the following context fails, the parser + // cannot retry a different branch. Lazy quantifiers expose this because they interact heavily + // with alternation (e.g. a|ab matches "ab" requires the engine to try both branches). Until + // the generator is extended with full continuation-passing backtracking, lazy patterns route + // to java.util.regex which handles them correctly. + // + // OPTIMIZED_NFA_WITH_BACKREFS findMatchFromMethod always returns the LONGEST match (it tries + // all end positions and keeps the maximum). Lazy quantifiers require the SHORTEST match. + // Without proper lazy-aware result selection, these patterns produce wrong spans. + if ((strategy == PatternAnalyzer.MatchingStrategy.RECURSIVE_DESCENT + || strategy == PatternAnalyzer.MatchingStrategy.OPTIMIZED_NFA_WITH_BACKREFS) + && v.hasLazyQuantifier) { + return "lazy quantifier: requires shortest-match semantics not supported by this strategy"; + } + + // Thompson NFA group-state contamination (OPTIMIZED_NFA_WITH_BACKREFS) and RECURSIVE_DESCENT + // backtracking limitations: both fail when a backref \N appears in one alternative of an + // alternation but group N is defined in a DIFFERENT alternative of the same alternation. + if ((strategy == PatternAnalyzer.MatchingStrategy.OPTIMIZED_NFA_WITH_BACKREFS + || strategy == PatternAnalyzer.MatchingStrategy.RECURSIVE_DESCENT) + && hasCrossAlternativeBackref(ast)) { + return "cross-alternative backref: group captured in one branch, used in another"; + } + + // Parallel NFA simulation uses shared group arrays across all active paths. When a backref + // \N references a group that can capture the empty string (nullable), the greedy path may + // record a non-zero groupLen while the empty-capture path needs groupLen=0. The shared + // array records the wrong value, causing the backref check to fail or spuriously succeed. + if (strategy == PatternAnalyzer.MatchingStrategy.OPTIMIZED_NFA_WITH_BACKREFS + && hasNullableBackrefGroup(ast)) { + return "backref to nullable group: parallel NFA simulation records wrong capture span"; + } + + // OptionalGroupBackref generator has a bug in the "group did not participate" path: + // it treats \N as vacuously satisfied (matching empty) when the optional group was skipped. + // Java semantics: \N to a non-participating group FAILS. + // The bug only manifests for (X)? forms where X is non-nullable (the group might not + // participate at all). The (X|) empty-alt form always captures something, so no bug there. + if (strategy == PatternAnalyzer.MatchingStrategy.OPTIONAL_GROUP_BACKREF + && hasNonNullableQuantifiedOptionalGroupWithBackref(ast)) { + return "optional group backref with non-nullable (X)? form: unmatched group wrongly " + + "treated as empty"; + } + return null; } + /** + * Returns true if any AlternationNode in {@code ast} has a backref \N in one alternative where + * group N's capturing paren is in a DIFFERENT alternative of that same alternation. + */ + private static boolean hasCrossAlternativeBackref(RegexNode ast) { + if (ast instanceof AlternationNode) { + AlternationNode alt = (AlternationNode) ast; + List alts = alt.alternatives; + @SuppressWarnings("unchecked") + Set[] groups = new Set[alts.size()]; + @SuppressWarnings("unchecked") + Set[] backrefs = new Set[alts.size()]; + Set allGroupsInAlt = new HashSet<>(); + for (int i = 0; i < alts.size(); i++) { + groups[i] = new HashSet<>(); + backrefs[i] = new HashSet<>(); + collectGroupsInSubtree(alts.get(i), groups[i]); + collectBackrefsInSubtree(alts.get(i), backrefs[i]); + allGroupsInAlt.addAll(groups[i]); + } + for (int i = 0; i < alts.size(); i++) { + for (int groupNum : backrefs[i]) { + if (!groups[i].contains(groupNum) && allGroupsInAlt.contains(groupNum)) { + return true; + } + } + } + for (RegexNode alternative : alts) { + if (hasCrossAlternativeBackref(alternative)) return true; + } + return false; + } + if (ast instanceof ConcatNode) { + for (RegexNode child : ((ConcatNode) ast).children) { + if (hasCrossAlternativeBackref(child)) return true; + } + } + if (ast instanceof GroupNode) { + return hasCrossAlternativeBackref(((GroupNode) ast).child); + } + if (ast instanceof QuantifierNode) { + return hasCrossAlternativeBackref(((QuantifierNode) ast).child); + } + return false; + } + + private static void collectGroupsInSubtree(RegexNode node, Set groups) { + if (node instanceof GroupNode) { + GroupNode g = (GroupNode) node; + if (g.capturing) groups.add(g.groupNumber); + collectGroupsInSubtree(g.child, groups); + } else if (node instanceof ConcatNode) { + for (RegexNode c : ((ConcatNode) node).children) collectGroupsInSubtree(c, groups); + } else if (node instanceof AlternationNode) { + for (RegexNode a : ((AlternationNode) node).alternatives) collectGroupsInSubtree(a, groups); + } else if (node instanceof QuantifierNode) { + collectGroupsInSubtree(((QuantifierNode) node).child, groups); + } + } + + private static void collectBackrefsInSubtree(RegexNode node, Set backrefs) { + if (node instanceof BackreferenceNode) { + backrefs.add(((BackreferenceNode) node).groupNumber); + } else if (node instanceof ConcatNode) { + for (RegexNode c : ((ConcatNode) node).children) collectBackrefsInSubtree(c, backrefs); + } else if (node instanceof AlternationNode) { + for (RegexNode a : ((AlternationNode) node).alternatives) + collectBackrefsInSubtree(a, backrefs); + } else if (node instanceof GroupNode) { + collectBackrefsInSubtree(((GroupNode) node).child, backrefs); + } else if (node instanceof QuantifierNode) { + collectBackrefsInSubtree(((QuantifierNode) node).child, backrefs); + } + } + + /** + * Returns true if the pattern has a (X)? optional group (not an (X|) empty-alt group) AND a + * backref to that group. The (X)? form can result in the group not participating at all; when X + * is non-nullable, backref \N to the non-participating group should fail per Java semantics, but + * OptionalGroupBackrefBytecodeGenerator incorrectly treats it as empty. + */ + private static boolean hasNonNullableQuantifiedOptionalGroupWithBackref(RegexNode ast) { + Set backrefs = new HashSet<>(); + collectBackrefsInSubtree(ast, backrefs); + if (backrefs.isEmpty()) return false; + return hasQuantifiedOptionalGroupForBackref(ast, backrefs); + } + + /** + * Walk the AST looking for QuantifierNode(min=0,max=1,child=GroupNode(N,...)) where N is in the + * backref set and the group content is non-nullable. + */ + private static boolean hasQuantifiedOptionalGroupForBackref( + RegexNode node, Set backrefs) { + if (node instanceof QuantifierNode) { + QuantifierNode q = (QuantifierNode) node; + if (q.min == 0 && q.max == 1 && q.child instanceof GroupNode) { + GroupNode g = (GroupNode) q.child; + if (g.capturing && backrefs.contains(g.groupNumber) && !subtreeIsNullable(g.child)) { + return true; + } + } + return hasQuantifiedOptionalGroupForBackref(q.child, backrefs); + } + if (node instanceof ConcatNode) { + for (RegexNode c : ((ConcatNode) node).children) + if (hasQuantifiedOptionalGroupForBackref(c, backrefs)) return true; + } + if (node instanceof GroupNode) + return hasQuantifiedOptionalGroupForBackref(((GroupNode) node).child, backrefs); + if (node instanceof AlternationNode) { + for (RegexNode a : ((AlternationNode) node).alternatives) + if (hasQuantifiedOptionalGroupForBackref(a, backrefs)) return true; + } + return false; + } + + /** + * Returns true if any backref \N references a group whose content is nullable (can match the + * empty string). In parallel NFA simulation, when such a group exists, the shared group-capture + * arrays may be overwritten by the greedy (non-empty) path before the empty-capture path's + * backref check runs, causing the check to use the wrong capture span. + */ + private static boolean hasNullableBackrefGroup(RegexNode ast) { + Set backrefNums = new HashSet<>(); + collectBackrefsInSubtree(ast, backrefNums); + if (backrefNums.isEmpty()) return false; + for (int groupNum : backrefNums) { + if (isGroupNullable(ast, groupNum)) return true; + } + return false; + } + + /** Walk the AST to find the capturing group with the given number and test nullability. */ + private static boolean isGroupNullable(RegexNode node, int groupNum) { + if (node instanceof GroupNode) { + GroupNode g = (GroupNode) node; + if (g.capturing && g.groupNumber == groupNum) { + return subtreeIsNullable(g.child); + } + return isGroupNullable(g.child, groupNum); + } + if (node instanceof ConcatNode) { + for (RegexNode c : ((ConcatNode) node).children) { + if (isGroupNullable(c, groupNum)) return true; + } + return false; + } + if (node instanceof AlternationNode) { + for (RegexNode a : ((AlternationNode) node).alternatives) { + if (isGroupNullable(a, groupNum)) return true; + } + return false; + } + if (node instanceof QuantifierNode) { + return isGroupNullable(((QuantifierNode) node).child, groupNum); + } + return false; + } + + /** Returns true if the subtree can match the empty string (zero characters). */ + private static boolean subtreeIsNullable(RegexNode node) { + if (node instanceof QuantifierNode) { + return ((QuantifierNode) node).min == 0 || subtreeIsNullable(((QuantifierNode) node).child); + } + if (node instanceof ConcatNode) { + for (RegexNode c : ((ConcatNode) node).children) { + if (!subtreeIsNullable(c)) return false; + } + return true; + } + if (node instanceof AlternationNode) { + for (RegexNode a : ((AlternationNode) node).alternatives) { + if (subtreeIsNullable(a)) return true; + } + return false; + } + if (node instanceof GroupNode) { + return subtreeIsNullable(((GroupNode) node).child); + } + // AnchorNode is zero-width (nullable); LiteralNode and CharClassNode are not. + return node instanceof AnchorNode; + } + + /** + * Returns true if the AST contains an END-type anchor ($, \Z, \z) immediately before a + * char-consuming element within the same ConcatNode. Such patterns rely on `$` matching "before + * final newline" and then the subsequent element consuming that newline — a semantic that + * Reggie's DFA cannot express because it prunes consuming transitions from END-conditioned states + * without tracking whether the end-char is a newline at runtime. + */ + private static boolean hasEndAnchorBeforeConsumer(RegexNode ast) { + if (ast instanceof ConcatNode) { + ConcatNode concat = (ConcatNode) ast; + for (int i = 0; i < concat.children.size() - 1; i++) { + RegexNode child = concat.children.get(i); + if (child instanceof AnchorNode) { + AnchorNode anchor = (AnchorNode) child; + if (anchor.type == AnchorNode.Type.END || anchor.type == AnchorNode.Type.STRING_END) { + // Next sibling is a char consumer — this pattern needs JDK + return true; + } + } + } + for (RegexNode c : concat.children) if (hasEndAnchorBeforeConsumer(c)) return true; + } + if (ast instanceof GroupNode) return hasEndAnchorBeforeConsumer(((GroupNode) ast).child); + if (ast instanceof QuantifierNode) + return hasEndAnchorBeforeConsumer(((QuantifierNode) ast).child); + if (ast instanceof AlternationNode) { + for (RegexNode a : ((AlternationNode) ast).alternatives) + if (hasEndAnchorBeforeConsumer(a)) return true; + } + return false; + } + private static boolean isLookahead(AssertionNode.Type t) { return t == AssertionNode.Type.POSITIVE_LOOKAHEAD || t == AssertionNode.Type.NEGATIVE_LOOKAHEAD; } @@ -71,6 +345,22 @@ private static boolean isLookbehind(AssertionNode.Type t) { || t == AssertionNode.Type.NEGATIVE_LOOKBEHIND; } + private static boolean containsAnchor(RegexNode node) { + if (node instanceof AnchorNode) return true; + if (node instanceof GroupNode) return containsAnchor(((GroupNode) node).child); + if (node instanceof ConcatNode) { + for (RegexNode c : ((ConcatNode) node).children) { + if (containsAnchor(c)) return true; + } + } + if (node instanceof AlternationNode) { + for (RegexNode alt : ((AlternationNode) node).alternatives) { + if (containsAnchor(alt)) return true; + } + } + return false; + } + /** Returns true if {@code node} is or recursively contains a lookahead AssertionNode. */ private static boolean containsLookahead(RegexNode node) { if (node instanceof AssertionNode) { @@ -96,6 +386,8 @@ private static final class Visitor implements RegexVisitor { boolean lookaheadInQuantifier = false; boolean hasLookahead = false; boolean hasLookbehind = false; + boolean hasLazyQuantifier = false; + boolean hasAnchorInQuantifier = false; @Override public Void visitAssertion(AssertionNode node) { @@ -114,6 +406,15 @@ public Void visitQuantifier(QuantifierNode node) { if (containsLookahead(node.child)) { lookaheadInQuantifier = true; } + if (!node.greedy) { + hasLazyQuantifier = true; + } + if (containsAnchor(node.child) && (node.min != 1 || node.max != 1)) { + // Anchor inside a quantifier (other than {1}): the quantifier tries to repeat a + // zero-width assertion. This creates unusual NFA/DFA shapes that the current + // generators don't handle correctly (e.g. ${2}, ${0,3}, \z{2}). + hasAnchorInQuantifier = true; + } node.child.accept(this); return null; } diff --git a/reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/analysis/LinearPatternAnalyzer.java b/reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/analysis/LinearPatternAnalyzer.java index e84b944..447d19b 100644 --- a/reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/analysis/LinearPatternAnalyzer.java +++ b/reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/analysis/LinearPatternAnalyzer.java @@ -66,7 +66,9 @@ public static LinearPatternInfo analyze(RegexNode ast, int groupCount) { @Override public Void visitLiteral(LiteralNode node) { - // Match a literal character (convert to string) + if (node.ch == 0) { + return null; // Epsilon — empty-group placeholder (char 0), consume nothing + } operations.add( new LinearPatternInfo.LinearOperation( LinearPatternInfo.LinearOperation.Type.MATCH_LITERAL, String.valueOf(node.ch))); diff --git a/reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/analysis/LinearTokenSequencePlan.java b/reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/analysis/LinearTokenSequencePlan.java new file mode 100644 index 0000000..a9961f4 --- /dev/null +++ b/reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/analysis/LinearTokenSequencePlan.java @@ -0,0 +1,186 @@ +/* + * Copyright 2026-Present Datadog, Inc. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package com.datadoghq.reggie.codegen.analysis; + +import java.util.ArrayList; +import java.util.List; +import java.util.Optional; + +/** Executable, deterministic plan for a categorized linear-token-sequence regex. */ +public record LinearTokenSequencePlan(List ops, int groupCount) { + + public enum OpKind { + LITERAL, + WHITESPACE_PLUS, + CAPTURE_NON_SPACE, + CAPTURE_DIGITS, + CAPTURE_SIGNED_INTEGER, + CAPTURE_DECIMAL_NUMBER, + CAPTURE_SIGNED_DECIMAL_NUMBER, + CAPTURE_WORD, + CAPTURE_UNTIL_DELIMITER, + CAPTURE_QUOTED_UNTIL_DELIMITER, + CAPTURE_QUOTED_NON_SPACE, + CAPTURE_IP_OR_HOST, + CAPTURE_SIGNED_INTEGER_OR_DASH, + CAPTURE_SIGNED_INTEGER_OR_UNCAPTURED_DASH, + CAPTURE_BRACKETED_WORD_AFTER_SKIP, + SKIP_ANY, + ANCHOR, + OPTIONAL_SEQUENCE + } + + public record Op( + OpKind kind, int groupNumber, String literal, char delimiter, List children) { + public Op { + children = children == null ? List.of() : List.copyOf(children); + } + + static Op literal(String literal) { + return new Op(OpKind.LITERAL, 0, literal, (char) 0, List.of()); + } + + static Op capture(OpKind kind, int groupNumber) { + return new Op(kind, groupNumber, null, (char) 0, List.of()); + } + + static Op captureUntil(OpKind kind, int groupNumber, char delimiter) { + return new Op(kind, groupNumber, null, delimiter, List.of()); + } + + static Op uncaptured(OpKind kind) { + return new Op(kind, 0, null, (char) 0, List.of()); + } + + static Op optional(List children) { + return new Op(OpKind.OPTIONAL_SEQUENCE, 0, null, (char) 0, children); + } + } + + public LinearTokenSequencePlan { + ops = List.copyOf(ops); + } + + /** Converts categorizer atoms into a closed, executable linear-token-sequence plan. */ + public static Optional from(PatternCategorization categorization) { + if (!categorization.isLinearTokenSequence()) return Optional.empty(); + + List ops = new ArrayList<>(); + List atoms = categorization.atoms(); + int maxGroup = 0; + + for (int i = 0; i < atoms.size(); i++) { + PatternAtom atom = atoms.get(i); + maxGroup = Math.max(maxGroup, atom.groupNumber()); + + if (isQuotedCapture(atoms, i)) { + trimTrailingQuote(ops); + ops.add(quotedCaptureOp(atom)); + PatternAtom next = atoms.get(++i); + String remainder = next.literal().substring(1); + if (!remainder.isEmpty()) addLiteral(ops, remainder); + continue; + } + + Op op = opFor(atom); + if (op == null) return Optional.empty(); + if (op.kind == OpKind.LITERAL) { + addLiteral(ops, op.literal); + } else { + ops.add(op); + } + } + + return Optional.of(new LinearTokenSequencePlan(ops, maxGroup)); + } + + private static Op opFor(PatternAtom atom) { + return switch (atom.kind()) { + case LITERAL -> Op.literal(atom.literal()); + case WHITESPACE_PLUS -> Op.uncaptured(OpKind.WHITESPACE_PLUS); + case NON_SPACE_PLUS -> Op.capture(OpKind.CAPTURE_NON_SPACE, atom.groupNumber()); + case DIGITS_PLUS -> Op.capture(OpKind.CAPTURE_DIGITS, atom.groupNumber()); + case SIGNED_INTEGER -> Op.capture(OpKind.CAPTURE_SIGNED_INTEGER, atom.groupNumber()); + case SIGNED_INTEGER_OR_DASH -> + Op.capture(OpKind.CAPTURE_SIGNED_INTEGER_OR_DASH, atom.groupNumber()); + case SIGNED_INTEGER_OR_UNCAPTURED_DASH -> + Op.capture(OpKind.CAPTURE_SIGNED_INTEGER_OR_UNCAPTURED_DASH, atom.groupNumber()); + case DECIMAL_NUMBER -> Op.capture(OpKind.CAPTURE_DECIMAL_NUMBER, atom.groupNumber()); + case SIGNED_DECIMAL_NUMBER -> + Op.capture(OpKind.CAPTURE_SIGNED_DECIMAL_NUMBER, atom.groupNumber()); + case WORD -> Op.capture(OpKind.CAPTURE_WORD, atom.groupNumber()); + case IP_OR_HOST -> Op.capture(OpKind.CAPTURE_IP_OR_HOST, atom.groupNumber()); + case BRACKETED_WORD_AFTER_SKIP -> + Op.capture(OpKind.CAPTURE_BRACKETED_WORD_AFTER_SKIP, atom.groupNumber()); + case UNTIL_DELIMITER -> + Op.captureUntil(OpKind.CAPTURE_UNTIL_DELIMITER, atom.groupNumber(), atom.delimiter()); + case QUOTED_UNTIL_DELIMITER -> + Op.captureUntil( + OpKind.CAPTURE_QUOTED_UNTIL_DELIMITER, atom.groupNumber(), atom.delimiter()); + case ANY_STAR -> Op.uncaptured(OpKind.SKIP_ANY); + case ANCHOR -> Op.uncaptured(OpKind.ANCHOR); + case OPTIONAL_SEQUENCE -> optionalOpFor(atom); + case COMPLEX_ALTERNATION -> null; + }; + } + + private static Op quotedCaptureOp(PatternAtom atom) { + if (atom.kind() == PatternAtom.Kind.NON_SPACE_PLUS) { + return Op.captureUntil(OpKind.CAPTURE_QUOTED_NON_SPACE, atom.groupNumber(), '"'); + } + return Op.captureUntil( + OpKind.CAPTURE_QUOTED_UNTIL_DELIMITER, atom.groupNumber(), atom.delimiter()); + } + + private static Op optionalOpFor(PatternAtom atom) { + PatternCategorization nested = + new PatternCategorization( + PatternCategorization.Category.LINEAR_TOKEN_SEQUENCE, atom.children(), List.of()); + return LinearTokenSequencePlan.from(nested).map(plan -> Op.optional(plan.ops())).orElse(null); + } + + private static boolean isQuotedCapture(List atoms, int index) { + PatternAtom atom = atoms.get(index); + if (!((atom.kind() == PatternAtom.Kind.UNTIL_DELIMITER && atom.delimiter() == '"') + || atom.kind() == PatternAtom.Kind.NON_SPACE_PLUS)) return false; + return index > 0 + && index + 1 < atoms.size() + && atoms.get(index - 1).kind() == PatternAtom.Kind.LITERAL + && atoms.get(index - 1).literal().endsWith("\"") + && atoms.get(index + 1).kind() == PatternAtom.Kind.LITERAL + && atoms.get(index + 1).literal().startsWith("\""); + } + + private static void trimTrailingQuote(List ops) { + if (ops.isEmpty()) throw new IllegalStateException("missing literal before quoted capture"); + Op previous = ops.remove(ops.size() - 1); + if (previous.kind() != OpKind.LITERAL || !previous.literal().endsWith("\"")) { + throw new IllegalStateException("missing quote literal before quoted capture"); + } + String trimmed = previous.literal().substring(0, previous.literal().length() - 1); + if (!trimmed.isEmpty()) addLiteral(ops, trimmed); + } + + private static void addLiteral(List ops, String literal) { + if (literal.isEmpty()) return; + if (!ops.isEmpty() && ops.get(ops.size() - 1).kind() == OpKind.LITERAL) { + Op previous = ops.remove(ops.size() - 1); + ops.add(Op.literal(previous.literal() + literal)); + } else { + ops.add(Op.literal(literal)); + } + } +} diff --git a/reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/analysis/PatternAnalyzer.java b/reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/analysis/PatternAnalyzer.java index a2a3780..efa5753 100644 --- a/reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/analysis/PatternAnalyzer.java +++ b/reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/analysis/PatternAnalyzer.java @@ -18,6 +18,7 @@ import com.datadoghq.reggie.codegen.ast.*; import com.datadoghq.reggie.codegen.automaton.CharSet; import com.datadoghq.reggie.codegen.automaton.DFA; +import com.datadoghq.reggie.codegen.automaton.DFATableData; import com.datadoghq.reggie.codegen.automaton.NFA; import com.datadoghq.reggie.codegen.automaton.StateExplosionException; import com.datadoghq.reggie.codegen.automaton.SubsetConstructor; @@ -32,6 +33,10 @@ /** Analyzes patterns and recommends bytecode generation strategy. */ public class PatternAnalyzer { + private static final int DFA_UNROLLED_STATE_LIMIT = 20; + private static final int DFA_SWITCH_STATE_LIMIT = 300; + private static final int DFA_TABLE_ESTIMATED_BYTES_LIMIT = 1 << 20; + private final RegexNode ast; private final NFA nfa; @@ -720,9 +725,38 @@ public MatchingStrategyResult analyzeAndRecommend(boolean ignoreGroupCount) { // Build DFA with tag computation enabled for Tagged DFA DFA dfa = constructor.buildDFA(nfa, true); + if (dfa.isAnchorConditionDiluted()) { + MatchingStrategyResult r = + new MatchingStrategyResult( + MatchingStrategy.OPTIMIZED_NFA, + null, + null, + false, + requiredLiterals, + null, + needsPosixSemantics); + r.anchorConditionDiluted = true; + return r; + } + + if ((containsAlternation(ast) || containsOptionalQuantifier(ast)) + && dfaHasAcceptingStateWithTransitions(dfa)) { + MatchingStrategyResult r = + new MatchingStrategyResult( + MatchingStrategy.OPTIMIZED_NFA, + null, + null, + false, + requiredLiterals, + null, + needsPosixSemantics); + r.alternationPriorityConflict = true; + return r; + } + // DFA with groups: choose strategy based on state count int stateCount = dfa.getStateCount(); - if (stateCount < 20) { + if (stateCount < DFA_UNROLLED_STATE_LIMIT) { return new MatchingStrategyResult( MatchingStrategy.DFA_UNROLLED_WITH_GROUPS, dfa, @@ -731,8 +765,8 @@ public MatchingStrategyResult analyzeAndRecommend(boolean ignoreGroupCount) { requiredLiterals, null, needsPosixSemantics); - } else if (stateCount < 300) { - // Use switch-based DFA for 20-300 states (better cache behavior) + } else if (stateCount < DFA_SWITCH_STATE_LIMIT) { + // Use switch-based DFA for medium state counts (better cache behavior) return new MatchingStrategyResult( MatchingStrategy.DFA_SWITCH_WITH_GROUPS, dfa, @@ -771,18 +805,45 @@ public MatchingStrategyResult analyzeAndRecommend(boolean ignoreGroupCount) { SubsetConstructor constructor = new SubsetConstructor(); DFA dfa = constructor.buildDFA(nfa); + if (dfa.isAnchorConditionDiluted()) { + MatchingStrategyResult r = + new MatchingStrategyResult( + MatchingStrategy.OPTIMIZED_NFA, null, null, false, requiredLiterals); + r.anchorConditionDiluted = true; + return r; + } + + // Alternation priority: Java NFA semantics pick the first alternative that matches at a + // position; DFA semantics pick the longest match. When the pattern has explicit alternation + // AND the DFA has an unconditionally-accepting state that has further outgoing transitions, + // the DFA will return a longer match than the NFA would for the same position — violating + // Java regex semantics. Flag for caller to route to JDK fallback. + if (containsAlternation(ast) && dfaHasAcceptingStateWithTransitions(dfa)) { + MatchingStrategyResult r = + new MatchingStrategyResult( + MatchingStrategy.OPTIMIZED_NFA, null, null, false, requiredLiterals); + r.alternationPriorityConflict = true; + return r; + } + // Choose DFA strategy based on state count int stateCount = dfa.getStateCount(); - if (stateCount < 20) { + if (stateCount < DFA_UNROLLED_STATE_LIMIT) { return new MatchingStrategyResult( MatchingStrategy.DFA_UNROLLED, dfa, null, false, requiredLiterals); - } else if (stateCount < 300) { - // Use switch-based DFA for 20-300 states (better cache behavior) + } else if (stateCount < DFA_SWITCH_STATE_LIMIT) { + // Use switch-based DFA for medium state counts (better cache behavior) return new MatchingStrategyResult( MatchingStrategy.DFA_SWITCH, dfa, null, false, requiredLiterals); - } else { + } else if (isDFATableEligible(dfa)) { return new MatchingStrategyResult( MatchingStrategy.DFA_TABLE, dfa, null, false, requiredLiterals); + } else { + // Large DFA state spaces can still be too expensive as tables. Fall back to NFA simulation + // when the compressed table would exceed the configured memory budget or the DFA uses + // features not yet supported by the table backend. + return new MatchingStrategyResult( + MatchingStrategy.OPTIMIZED_NFA, null, null, false, requiredLiterals); } } catch (StateExplosionException e) { // DFA too large, use optimized NFA @@ -791,11 +852,133 @@ public MatchingStrategyResult analyzeAndRecommend(boolean ignoreGroupCount) { } } + private boolean isDFATableEligible(DFA dfa) { + if (nfa != null + && (nfa.getGroupCount() > 0 + || nfa.hasStartAnchor() + || nfa.hasEndAnchor() + || nfa.hasStringStartAnchor() + || nfa.hasStringEndAnchor() + || nfa.hasStringEndAbsoluteAnchor() + || nfa.hasMultilineStartAnchor() + || nfa.hasMultilineEndAnchor())) { + return false; + } + + for (DFA.DFAState state : dfa.getAllStates()) { + if (!state.assertionChecks.isEmpty() + || !state.groupActions.isEmpty() + || !state.acceptanceAnchorConditions.isEmpty()) { + return false; + } + for (DFA.DFATransition transition : state.transitions.values()) { + if (!transition.tagOps.isEmpty() || !transition.entryGuard.isEmpty()) { + return false; + } + } + } + + return DFATableData.from(dfa).estimatedBytes() <= DFA_TABLE_ESTIMATED_BYTES_LIMIT; + } + private boolean hasBackreferences(RegexNode node) { BackrefDetector detector = new BackrefDetector(); return node.accept(detector); } + /** + * Returns true if the AST contains an optional quantifier (min=0) INSIDE a capturing group that + * is itself in a repeating quantifier (outer + or * or {n,m} with nPatterns like (a*b*c*d*e*) are NOT flagged because the group is not inside a repeating + * quantifier — there is no loop that could cause the DFA to accumulate extra chars. + */ + private boolean containsOptionalQuantifier(RegexNode node) { + return hasOptionalInsideRepeatingGroup(node); + } + + private boolean hasOptionalInsideRepeatingGroup(RegexNode node) { + if (node instanceof QuantifierNode) { + QuantifierNode q = (QuantifierNode) node; + // Repeating group: outer quantifier with max>1 and the child is a group + if ((q.max == -1 || q.max > 1) && q.min >= 1 && q.child instanceof GroupNode) { + if (subtreeContainsOptional(q.child)) return true; + } + return hasOptionalInsideRepeatingGroup(q.child); + } + if (node instanceof ConcatNode) { + for (RegexNode c : ((ConcatNode) node).children) + if (hasOptionalInsideRepeatingGroup(c)) return true; + return false; + } + if (node instanceof GroupNode) return hasOptionalInsideRepeatingGroup(((GroupNode) node).child); + if (node instanceof AlternationNode) { + for (RegexNode a : ((AlternationNode) node).alternatives) + if (hasOptionalInsideRepeatingGroup(a)) return true; + return false; + } + return false; + } + + /** Returns true if the subtree contains any QuantifierNode with min=0. */ + private static boolean subtreeContainsOptional(RegexNode node) { + if (node instanceof QuantifierNode) { + if (((QuantifierNode) node).min == 0) return true; + return subtreeContainsOptional(((QuantifierNode) node).child); + } + if (node instanceof ConcatNode) { + for (RegexNode c : ((ConcatNode) node).children) if (subtreeContainsOptional(c)) return true; + return false; + } + if (node instanceof GroupNode) return subtreeContainsOptional(((GroupNode) node).child); + if (node instanceof AlternationNode) { + for (RegexNode a : ((AlternationNode) node).alternatives) + if (subtreeContainsOptional(a)) return true; + return false; + } + return false; + } + + private boolean containsAlternation(RegexNode node) { + if (node instanceof AlternationNode) return true; + if (node instanceof ConcatNode) { + for (RegexNode child : ((ConcatNode) node).children) { + if (containsAlternation(child)) return true; + } + return false; + } + if (node instanceof GroupNode) return containsAlternation(((GroupNode) node).child); + if (node instanceof QuantifierNode) return containsAlternation(((QuantifierNode) node).child); + return false; + } + + private boolean dfaHasAcceptingStateWithTransitions(DFA dfa) { + for (DFA.DFAState state : dfa.getAllStates()) { + if (state.accepting && !state.transitions.isEmpty()) { + // Unconditional acceptance with outgoing transitions: DFA longest-match + // will advance past the zero-width match point. + if (state.acceptanceAnchorConditions.isEmpty()) { + return true; + } + // START-class anchor acceptance with outgoing transitions: at position 0 + // (where ^/\A fire) the DFA can still advance via transitions, diverging + // from NFA first-alternative semantics. END-class anchors are safe because + // transitions cannot fire at end-of-input. + for (NFA.AnchorType a : state.acceptanceAnchorConditions) { + if (a == NFA.AnchorType.START + || a == NFA.AnchorType.STRING_START + || a == NFA.AnchorType.START_MULTILINE) { + return true; + } + } + } + } + return false; + } + private boolean hasLookaround(RegexNode node) { LookaroundDetector detector = new LookaroundDetector(); return node.accept(detector); @@ -1049,7 +1232,8 @@ private CharSet getGreedyGroupCharSet(RegexNode node) { /** Get the CharSet that a node can match. Returns null if the CharSet cannot be determined. */ private CharSet getNodeCharSet(RegexNode node) { if (node instanceof CharClassNode) { - return ((CharClassNode) node).chars; + CharClassNode charClass = (CharClassNode) node; + return charClass.negated ? charClass.chars.complement() : charClass.chars; } if (node instanceof LiteralNode) { LiteralNode lit = (LiteralNode) node; @@ -1080,7 +1264,8 @@ private CharSet getNodeCharSet(RegexNode node) { */ private CharSet getFirstCharSet(RegexNode node) { if (node instanceof CharClassNode) { - return ((CharClassNode) node).chars; + CharClassNode charClass = (CharClassNode) node; + return charClass.negated ? charClass.chars.complement() : charClass.chars; } if (node instanceof LiteralNode) { LiteralNode lit = (LiteralNode) node; @@ -1773,6 +1958,24 @@ public static class MatchingStrategyResult { public final boolean usePosixLastMatch; // Use POSIX last-match semantics for groups in quantifiers + /** + * True when the DFA construction detected an anchor-condition dilution: alternation branches + * with differing positional-anchor requirements contributed to the same partition slice (or + * accept site), and the intersection logic collapsed those requirements to unconditional. The + * built DFA is structurally valid but semantically incorrect for such patterns; callers should + * route to a correct fallback engine (e.g. {@code JavaRegexFallbackMatcher}) rather than using + * the DFA. + */ + public boolean anchorConditionDiluted; + + /** + * True when the pattern has alternation and the DFA has an unconditionally-accepting state with + * further outgoing transitions. In this case the DFA uses longest-match semantics but Java NFA + * semantics require first-alternative preference. Callers should route to a correct fallback + * engine (e.g. {@code JavaRegexFallbackMatcher}) rather than using the DFA. + */ + public boolean alternationPriorityConflict; + public MatchingStrategyResult(MatchingStrategy strategy, DFA dfa) { this(strategy, dfa, null, false, java.util.Collections.emptySet(), null, false); } @@ -4991,6 +5194,10 @@ private GreedyCharClassInfo detectGreedyCharClass(RegexNode ast) { if (quant.max != -1 && quant.max > 1000) { return null; // Too large to optimize } + // {0} always produces an empty match; specialized generator does not handle max=0 correctly. + if (quant.max == 0) { + return null; + } if (!(quant.child instanceof CharClassNode)) { return null; @@ -5132,6 +5339,27 @@ private BoundedQuantifierInfo detectBoundedQuantifierSequence(RegexNode ast) { return null; } + // Reject when an optional element could greedily consume a character required by a later + // element. Example: c?c — the optional consumes the only 'c', leaving none for the required + // literal. The generator has no backtracking, so it would incorrectly return no-match. + for (int i = 0; i < elements.size() - 1; i++) { + if (!(elements.get(i) instanceof BoundedOptionalElement)) continue; + char optCh = ((BoundedOptionalElement) elements.get(i)).literal; + for (int j = i + 1; j < elements.size(); j++) { + BoundedElement next = elements.get(j); + if (next instanceof BoundedLiteralElement + && ((BoundedLiteralElement) next).literal == optCh) { + return null; + } + if (next instanceof BoundedQuantifierElement) { + BoundedQuantifierElement qe = (BoundedQuantifierElement) next; + if (qe.min >= 1 && (qe.charset.contains(optCh) != qe.negated)) { + return null; + } + } + } + } + return new BoundedQuantifierInfo(elements, minLen, maxLen, groupCounter[0]); } @@ -5823,8 +6051,15 @@ private QuantifierNode extractQuantifierFromPattern(RegexNode ast) { for (RegexNode child : concat.children) { if (child instanceof AnchorNode) { - // Skip anchors - continue; + // Skip START-type anchors — they are handled by requiresStartAnchor and do not + // affect SPECIALIZED_QUANTIFIED_GROUP semantics. + // END-type anchors ($, \Z, \z) make the pattern semantically impossible to match + // at non-end positions; the specialized generator is unaware of them, so bail out. + AnchorNode anchor = (AnchorNode) child; + if (anchor.type == AnchorNode.Type.START || anchor.type == AnchorNode.Type.STRING_START) { + continue; + } + return null; // END / STRING_END / STRING_END_ABSOLUTE / multiline anchors } else if (child instanceof QuantifierNode) { if (foundQuant != null) { // Multiple quantifiers - not a simple pattern @@ -5893,6 +6128,7 @@ private QuantifiedGroupInfo detectQuantifiedCapturingGroup(RegexNode ast) { String literal = null; boolean isAlternation = false; CharSet[] alternationCharSets = null; + boolean isNegatedCC = false; // negation flag for direct CharClassNode groups if (groupChild instanceof LiteralNode) { LiteralNode lit = (LiteralNode) groupChild; @@ -5905,8 +6141,10 @@ private QuantifiedGroupInfo detectQuantifiedCapturingGroup(RegexNode ast) { literal = String.valueOf(lit.ch); charSet = CharSet.of(lit.ch); } else if (groupChild instanceof CharClassNode) { - // ([a-z])+ - charSet = ((CharClassNode) groupChild).chars; + // ([a-z])+ or ([^b])+ + CharClassNode ccNode = (CharClassNode) groupChild; + charSet = ccNode.chars; + isNegatedCC = ccNode.negated; } else if (groupChild instanceof AlternationNode) { // (a|b)+ or (a+|b)+ AlternationNode alt = (AlternationNode) groupChild; @@ -5993,6 +6231,15 @@ private QuantifiedGroupInfo detectQuantifiedCapturingGroup(RegexNode ast) { return null; } + // When the OUTER quantifier is fixed (min==max, e.g. {4}) and the INNER quantifier is + // unbounded (like .+), the greedy inner loop consumes ALL remaining chars in the first + // outer iteration, leaving none for iterations 2..N (backtracking is needed). + // When the outer quantifier is unbounded (e.g. + or *), only one outer iteration fires + // on the available chars so the greedy inner loop is correct. + if ((innerQuant.max == -1 || innerQuant.max == Integer.MAX_VALUE) && quant.min == quant.max) { + return null; + } + // Extract charset from inner quantifier's child if (innerQuant.child instanceof LiteralNode) { charSet = CharSet.of(((LiteralNode) innerQuant.child).ch); @@ -6044,7 +6291,16 @@ private QuantifiedGroupInfo detectQuantifiedCapturingGroup(RegexNode ast) { charSet, literal, isAlternation, - alternationCharSets); + alternationCharSets, + false, // hasComplexAlternation + null, // alternationMinBounds + null, // alternationMaxBounds + null, // alternationNegated + null, // alternatives + false, // hasNestedQuantifier + 1, // innerMinQuantifier + 1, // innerMaxQuantifier + isNegatedCC); // isNegatedCharSet — propagates [^x] negation correctly } /** diff --git a/reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/analysis/PatternAtom.java b/reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/analysis/PatternAtom.java new file mode 100644 index 0000000..66c016e --- /dev/null +++ b/reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/analysis/PatternAtom.java @@ -0,0 +1,83 @@ +/* + * Copyright 2026-Present Datadog, Inc. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package com.datadoghq.reggie.codegen.analysis; + +import java.util.List; + +/** A semantic atom recognized by {@link PatternCategorizer}. */ +public record PatternAtom( + Kind kind, + int groupNumber, + String groupName, + String literal, + char delimiter, + List children) { + + public enum Kind { + LITERAL, + WHITESPACE_PLUS, + NON_SPACE_PLUS, + DIGITS_PLUS, + SIGNED_INTEGER, + SIGNED_INTEGER_OR_DASH, + SIGNED_INTEGER_OR_UNCAPTURED_DASH, + DECIMAL_NUMBER, + SIGNED_DECIMAL_NUMBER, + WORD, + IP_OR_HOST, + UNTIL_DELIMITER, + QUOTED_UNTIL_DELIMITER, + COMPLEX_ALTERNATION, + ANY_STAR, + ANCHOR, + OPTIONAL_SEQUENCE, + BRACKETED_WORD_AFTER_SKIP + } + + public PatternAtom { + children = children == null ? List.of() : List.copyOf(children); + } + + public static PatternAtom literal(String literal) { + return new PatternAtom(Kind.LITERAL, 0, null, literal, (char) 0, List.of()); + } + + public static PatternAtom uncaptured(Kind kind) { + return new PatternAtom(kind, 0, null, null, (char) 0, List.of()); + } + + public static PatternAtom captured(Kind kind, int groupNumber, String groupName) { + return new PatternAtom(kind, groupNumber, groupName, null, (char) 0, List.of()); + } + + public static PatternAtom capturedUntil(int groupNumber, String groupName, char delimiter) { + return new PatternAtom( + Kind.UNTIL_DELIMITER, groupNumber, groupName, null, delimiter, List.of()); + } + + public static PatternAtom capturedQuotedUntil(int groupNumber, String groupName, char delimiter) { + return new PatternAtom( + Kind.QUOTED_UNTIL_DELIMITER, groupNumber, groupName, null, delimiter, List.of()); + } + + public static PatternAtom optionalSequence(List children) { + return new PatternAtom(Kind.OPTIONAL_SEQUENCE, 0, null, null, (char) 0, children); + } + + public boolean isCaptured() { + return groupNumber > 0; + } +} diff --git a/reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/analysis/PatternCategorization.java b/reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/analysis/PatternCategorization.java new file mode 100644 index 0000000..1838802 --- /dev/null +++ b/reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/analysis/PatternCategorization.java @@ -0,0 +1,38 @@ +/* + * Copyright 2026-Present Datadog, Inc. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package com.datadoghq.reggie.codegen.analysis; + +import java.util.List; + +/** Result produced by {@link PatternCategorizer}. */ +public record PatternCategorization( + Category category, List atoms, List notes) { + + public enum Category { + /** A deterministic sequence of reusable token atoms. */ + LINEAR_TOKEN_SEQUENCE, + + /** A pure literal sequence. */ + LITERAL_SEQUENCE, + + /** The pattern is valid but not yet represented by a reusable category. */ + GENERAL_REGEX + } + + public boolean isLinearTokenSequence() { + return category == Category.LINEAR_TOKEN_SEQUENCE || category == Category.LITERAL_SEQUENCE; + } +} diff --git a/reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/analysis/PatternCategorizer.java b/reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/analysis/PatternCategorizer.java new file mode 100644 index 0000000..c65b0c8 --- /dev/null +++ b/reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/analysis/PatternCategorizer.java @@ -0,0 +1,707 @@ +/* + * Copyright 2026-Present Datadog, Inc. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package com.datadoghq.reggie.codegen.analysis; + +import com.datadoghq.reggie.codegen.ast.AlternationNode; +import com.datadoghq.reggie.codegen.ast.AnchorNode; +import com.datadoghq.reggie.codegen.ast.AssertionNode; +import com.datadoghq.reggie.codegen.ast.BackreferenceNode; +import com.datadoghq.reggie.codegen.ast.BranchResetNode; +import com.datadoghq.reggie.codegen.ast.CharClassNode; +import com.datadoghq.reggie.codegen.ast.ConcatNode; +import com.datadoghq.reggie.codegen.ast.ConditionalNode; +import com.datadoghq.reggie.codegen.ast.GroupNode; +import com.datadoghq.reggie.codegen.ast.LiteralNode; +import com.datadoghq.reggie.codegen.ast.QuantifierNode; +import com.datadoghq.reggie.codegen.ast.RegexNode; +import com.datadoghq.reggie.codegen.ast.RegexVisitor; +import com.datadoghq.reggie.codegen.ast.SubroutineNode; +import com.datadoghq.reggie.codegen.automaton.CharSet; +import java.util.ArrayList; +import java.util.List; + +/** + * Deterministically classifies parsed regex ASTs into reusable execution categories. + * + *

This is intentionally structural: categories are derived from AST shape and reusable semantic + * atoms, not from Grok capture names or exact pattern strings. The initial category vocabulary is + * focused on linear delimited log templates, but unsupported shapes simply classify as {@code + * GENERAL_REGEX} so normal Reggie strategy selection can continue unchanged. + */ +public final class PatternCategorizer { + + private PatternCategorizer() {} + + public static PatternCategorization categorize(RegexNode node) { + Collector collector = new Collector(); + boolean recognized = collector.collect(node); + if (!recognized) { + return new PatternCategorization( + PatternCategorization.Category.GENERAL_REGEX, + List.copyOf(collector.atoms), + List.copyOf(collector.notes)); + } + + collector.flushLiteral(); + boolean onlyLiterals = + collector.atoms.stream().allMatch(a -> a.kind() == PatternAtom.Kind.LITERAL); + return new PatternCategorization( + onlyLiterals + ? PatternCategorization.Category.LITERAL_SEQUENCE + : PatternCategorization.Category.LINEAR_TOKEN_SEQUENCE, + List.copyOf(collector.atoms), + List.copyOf(collector.notes)); + } + + private static final class Collector implements RegexVisitor { + private final List atoms = new ArrayList<>(); + private final List notes = new ArrayList<>(); + private final StringBuilder literal = new StringBuilder(); + + boolean collect(RegexNode node) { + return node.accept(this); + } + + @Override + public Boolean visitLiteral(LiteralNode node) { + literal.append(node.ch); + return true; + } + + @Override + public Boolean visitCharClass(CharClassNode node) { + flushLiteral(); + if (node.chars.equals(CharSet.WHITESPACE) && !node.negated) { + atoms.add(PatternAtom.uncaptured(PatternAtom.Kind.WHITESPACE_PLUS)); + notes.add("bare whitespace character class is categorized as a single whitespace atom"); + return true; + } + notes.add("unsupported bare character class: " + node); + return false; + } + + @Override + public Boolean visitConcat(ConcatNode node) { + for (int i = 0; i < node.children.size(); i++) { + if (isTrailingBracketedWordSearch(node.children, i)) { + flushLiteral(); + GroupNode group = (GroupNode) node.children.get(i + 3); + atoms.add( + PatternAtom.captured( + PatternAtom.Kind.BRACKETED_WORD_AFTER_SKIP, group.groupNumber, group.name)); + i += 6; + continue; + } + + RegexNode child = node.children.get(i); + if (isBracketDelimitedComplexCapture(node.children, i)) { + flushLiteral(); + GroupNode group = (GroupNode) child; + atoms.add(PatternAtom.capturedUntil(group.groupNumber, group.name, ']')); + continue; + } + + if (!collect(child)) return false; + } + return true; + } + + @Override + public Boolean visitAlternation(AlternationNode node) { + PatternAtom signedIntegerOrDash = signedIntegerOrDashAtom(node, 0, null); + if (signedIntegerOrDash != null) { + flushLiteral(); + atoms.add(signedIntegerOrDash); + return true; + } + + List optional = optionalAlternativeAtoms(node); + if (optional != null) { + flushLiteral(); + atoms.add(PatternAtom.optionalSequence(optional)); + return true; + } + + flushLiteral(); + atoms.add(PatternAtom.uncaptured(PatternAtom.Kind.COMPLEX_ALTERNATION)); + notes.add("alternation categorized as complex reusable atom"); + return true; + } + + @Override + public Boolean visitQuantifier(QuantifierNode node) { + PatternAtom atom = atomForQuantifier(node, 0, null); + if (atom == null) { + notes.add("unsupported quantifier shape: " + node); + return false; + } + flushLiteral(); + atoms.add(atom); + return true; + } + + @Override + public Boolean visitGroup(GroupNode node) { + if (!node.capturing) { + return collect(node.child); + } + PatternAtom atom = atomForGroup(node); + if (atom != null) { + flushLiteral(); + atoms.add(atom); + return true; + } + notes.add("capturing group is not a recognized linear atom: " + node); + return false; + } + + @Override + public Boolean visitAnchor(AnchorNode node) { + flushLiteral(); + atoms.add(PatternAtom.uncaptured(PatternAtom.Kind.ANCHOR)); + return true; + } + + @Override + public Boolean visitBackreference(BackreferenceNode node) { + notes.add("backreference is not linear-token-sequence categorizable"); + return false; + } + + @Override + public Boolean visitAssertion(AssertionNode node) { + notes.add("lookaround assertion is not linear-token-sequence categorizable yet"); + return false; + } + + @Override + public Boolean visitSubroutine(SubroutineNode node) { + notes.add("subroutine is not linear-token-sequence categorizable"); + return false; + } + + @Override + public Boolean visitConditional(ConditionalNode node) { + notes.add("conditional is not linear-token-sequence categorizable"); + return false; + } + + @Override + public Boolean visitBranchReset(BranchResetNode node) { + notes.add("branch-reset group is not linear-token-sequence categorizable"); + return false; + } + + void flushLiteral() { + if (literal.length() > 0) { + atoms.add(PatternAtom.literal(literal.toString())); + literal.setLength(0); + } + } + + private static PatternAtom atomForGroup(GroupNode node) { + int groupNumber = node.capturing ? node.groupNumber : 0; + String groupName = node.name; + RegexNode child = stripNonCapturingGroup(node.child); + + if (child instanceof QuantifierNode quantifier) { + return atomForQuantifier(quantifier, groupNumber, groupName); + } + if (isWordBoundaryWordBoundary(child)) { + return PatternAtom.captured(PatternAtom.Kind.WORD, groupNumber, groupName); + } + if (isSignedInteger(child)) { + return PatternAtom.captured(PatternAtom.Kind.SIGNED_INTEGER, groupNumber, groupName); + } + if (isDecimalNumber(child)) { + return PatternAtom.captured(PatternAtom.Kind.DECIMAL_NUMBER, groupNumber, groupName); + } + if (isSignedDecimalNumber(child)) { + return PatternAtom.captured(PatternAtom.Kind.SIGNED_DECIMAL_NUMBER, groupNumber, groupName); + } + if (child instanceof AlternationNode alternation) { + PatternAtom signedIntegerOrDash = + signedIntegerOrDashAtom(alternation, groupNumber, groupName); + if (signedIntegerOrDash != null) return signedIntegerOrDash; + if (isIpOrHostAlternation(alternation)) { + return PatternAtom.captured(PatternAtom.Kind.IP_OR_HOST, groupNumber, groupName); + } + return PatternAtom.captured(PatternAtom.Kind.COMPLEX_ALTERNATION, groupNumber, groupName); + } + return null; + } + + private static PatternAtom atomForQuantifier( + QuantifierNode node, int groupNumber, String groupName) { + if (!node.greedy) return null; + RegexNode child = stripNonCapturingGroup(node.child); + if (node.min == 1 && node.max == -1 && child instanceof CharClassNode charClass) { + if (charClass.chars.equals(CharSet.WHITESPACE) && charClass.negated) { + return PatternAtom.captured(PatternAtom.Kind.NON_SPACE_PLUS, groupNumber, groupName); + } + if (charClass.chars.equals(CharSet.WHITESPACE) && !charClass.negated) { + return PatternAtom.captured(PatternAtom.Kind.WHITESPACE_PLUS, groupNumber, groupName); + } + if (charClass.chars.equals(CharSet.DIGIT) && !charClass.negated) { + return PatternAtom.captured(PatternAtom.Kind.DIGITS_PLUS, groupNumber, groupName); + } + if (charClass.chars.equals(CharSet.WORD) && !charClass.negated) { + return PatternAtom.captured(PatternAtom.Kind.WORD, groupNumber, groupName); + } + if (isIpOrHostCharClass(charClass)) { + return PatternAtom.captured(PatternAtom.Kind.IP_OR_HOST, groupNumber, groupName); + } + Character delimiter = singleNegatedDelimiter(charClass); + if (delimiter != null) { + return PatternAtom.capturedUntil(groupNumber, groupName, delimiter); + } + } + if (node.min == 0 && node.max == -1) { + if (child instanceof CharClassNode charClass) { + Character delimiter = singleNegatedDelimiter(charClass); + if (delimiter != null) { + return PatternAtom.capturedUntil(groupNumber, groupName, delimiter); + } + if ((charClass.chars.equals(CharSet.ANY) + || charClass.chars.equals(CharSet.ANY_EXCEPT_NEWLINE)) + && !charClass.negated) { + return PatternAtom.captured(PatternAtom.Kind.ANY_STAR, groupNumber, groupName); + } + } + } + return null; + } + + private static List optionalAlternativeAtoms(AlternationNode node) { + if (node.alternatives.size() != 2) return null; + RegexNode left = node.alternatives.get(0); + RegexNode right = node.alternatives.get(1); + RegexNode present; + if (isEmptyAlternative(left)) { + present = right; + } else if (isEmptyAlternative(right)) { + present = left; + } else { + return null; + } + Collector nested = new Collector(); + if (!nested.collect(present)) return null; + nested.flushLiteral(); + return nested.atoms; + } + + private static PatternAtom signedIntegerOrDashAtom( + AlternationNode node, int groupNumber, String groupName) { + if (node.alternatives.size() != 2) return null; + boolean hasDash = false; + boolean hasInteger = false; + int capturedGroupNumber = groupNumber; + String capturedGroupName = groupName; + for (RegexNode alternative : node.alternatives) { + RegexNode child = stripNonCapturingGroup(alternative); + if (child instanceof LiteralNode literal && literal.ch == '-') { + hasDash = true; + } else if (child instanceof GroupNode group + && isSignedInteger(stripNonCapturingGroup(group.child))) { + hasInteger = true; + capturedGroupNumber = group.groupNumber; + capturedGroupName = group.name; + } else if (isSignedInteger(child)) { + hasInteger = true; + } + } + if (!hasDash || !hasInteger) return null; + PatternAtom.Kind kind = + groupNumber > 0 + ? PatternAtom.Kind.SIGNED_INTEGER_OR_DASH + : PatternAtom.Kind.SIGNED_INTEGER_OR_UNCAPTURED_DASH; + return PatternAtom.captured(kind, capturedGroupNumber, capturedGroupName); + } + + private static boolean isEmptyAlternative(RegexNode node) { + if (node instanceof LiteralNode literal) return literal.ch == 0; + return node instanceof ConcatNode concat && concat.children.isEmpty(); + } + + private static boolean isBracketDelimitedComplexCapture(List children, int index) { + if (index == 0 || index + 1 >= children.size()) return false; + return children.get(index) instanceof GroupNode group + && group.capturing + && atomForGroup(group) == null + && children.get(index - 1) instanceof LiteralNode open + && open.ch == '[' + && children.get(index + 1) instanceof LiteralNode close + && close.ch == ']' + && !containsBacktrackingControl(group.child); + } + + private static boolean isTrailingBracketedWordSearch(List children, int index) { + if (index + 6 >= children.size()) return false; + return isAnyStar(children.get(index)) + && children.get(index + 1) instanceof LiteralNode spaceBefore + && spaceBefore.ch == ' ' + && children.get(index + 2) instanceof LiteralNode open + && open.ch == '[' + && children.get(index + 3) instanceof GroupNode group + && group.capturing + && isWordBoundaryWordBoundary(stripNonCapturingGroup(group.child)) + && children.get(index + 4) instanceof LiteralNode close + && close.ch == ']' + && children.get(index + 5) instanceof LiteralNode spaceAfter + && spaceAfter.ch == ' ' + && isAnyStar(children.get(index + 6)); + } + + private static boolean containsBacktrackingControl(RegexNode node) { + return node.accept( + new RegexVisitor() { + @Override + public Boolean visitLiteral(LiteralNode node) { + return false; + } + + @Override + public Boolean visitCharClass(CharClassNode node) { + return false; + } + + @Override + public Boolean visitConcat(ConcatNode node) { + return node.children.stream().anyMatch(child -> child.accept(this)); + } + + @Override + public Boolean visitAlternation(AlternationNode node) { + return node.alternatives.stream().anyMatch(child -> child.accept(this)); + } + + @Override + public Boolean visitQuantifier(QuantifierNode node) { + return !node.greedy || node.child.accept(this); + } + + @Override + public Boolean visitGroup(GroupNode node) { + return node.child.accept(this); + } + + @Override + public Boolean visitAnchor(AnchorNode node) { + return false; + } + + @Override + public Boolean visitBackreference(BackreferenceNode node) { + return true; + } + + @Override + public Boolean visitAssertion(AssertionNode node) { + return true; + } + + @Override + public Boolean visitSubroutine(SubroutineNode node) { + return true; + } + + @Override + public Boolean visitConditional(ConditionalNode node) { + return true; + } + + @Override + public Boolean visitBranchReset(BranchResetNode node) { + return true; + } + }); + } + + private static boolean isAnyStar(RegexNode node) { + if (!(node instanceof QuantifierNode quantifier) + || quantifier.min != 0 + || quantifier.max != -1 + || !quantifier.greedy + || !(quantifier.child instanceof CharClassNode charClass) + || charClass.negated) { + return false; + } + return charClass.chars.equals(CharSet.ANY) + || charClass.chars.equals(CharSet.ANY_EXCEPT_NEWLINE); + } + + private static RegexNode stripNonCapturingGroup(RegexNode node) { + while (node instanceof GroupNode group && !group.capturing) { + node = group.child; + } + return node; + } + + private static Character singleNegatedDelimiter(CharClassNode node) { + if (!node.negated || !node.chars.isSingleChar()) return null; + return node.chars.getSingleChar(); + } + + private static boolean isIpOrHostCharClass(CharClassNode node) { + return !node.negated + && node.chars.contains('0') + && node.chars.contains('9') + && node.chars.contains('A') + && node.chars.contains('F') + && node.chars.contains('a') + && node.chars.contains('f') + && node.chars.contains('.') + && node.chars.contains(':'); + } + + private static boolean isIpOrHostAlternation(AlternationNode node) { + return containsIpLikeShape(node) && containsHostLikeShape(node); + } + + private static boolean containsIpLikeShape(RegexNode node) { + if (isIpLikeAlternative(node) || isHexColonHeavy(node)) return true; + if (node instanceof AlternationNode alternation) { + return alternation.alternatives.stream().anyMatch(Collector::containsIpLikeShape); + } + if (node instanceof ConcatNode concat) { + return concat.children.stream().anyMatch(Collector::containsIpLikeShape); + } + if (node instanceof GroupNode group) return containsIpLikeShape(group.child); + if (node instanceof QuantifierNode quantifier) return containsIpLikeShape(quantifier.child); + return false; + } + + private static boolean containsHostLikeShape(RegexNode node) { + if (isHostLikeAlternative(node) || isHostnameLabelSequence(node)) return true; + if (node instanceof AlternationNode alternation) { + return alternation.alternatives.stream().anyMatch(Collector::containsHostLikeShape); + } + if (node instanceof ConcatNode concat) { + return concat.children.stream().anyMatch(Collector::containsHostLikeShape); + } + if (node instanceof GroupNode group) return containsHostLikeShape(group.child); + if (node instanceof QuantifierNode quantifier) return containsHostLikeShape(quantifier.child); + return false; + } + + private static boolean isIpLikeAlternative(RegexNode node) { + if (!(node instanceof ConcatNode concat) || concat.children.size() != 2) return false; + RegexNode repeatedOctet = stripNonCapturingGroup(concat.children.get(0)); + return repeatedOctet instanceof QuantifierNode quantifier + && quantifier.min == 3 + && quantifier.max == 3 + && stripNonCapturingGroup(quantifier.child) instanceof ConcatNode octetWithDot + && octetWithDot.children.size() == 2 + && isDigitRepeat(octetWithDot.children.get(0), 1, 3) + && octetWithDot.children.get(1) instanceof LiteralNode dot + && dot.ch == '.' + && isDigitRepeat(concat.children.get(1), 1, 3); + } + + private static boolean isHostLikeAlternative(RegexNode node) { + if (!(node instanceof QuantifierNode quantifier) + || quantifier.min != 1 + || quantifier.max != -1 + || !(quantifier.child instanceof CharClassNode charClass) + || charClass.negated) { + return false; + } + return charClass.chars.contains('a') + && charClass.chars.contains('z') + && charClass.chars.contains('A') + && charClass.chars.contains('Z') + && charClass.chars.contains('0') + && charClass.chars.contains('9') + && charClass.chars.contains('.') + && charClass.chars.contains('-'); + } + + private static boolean isHexColonHeavy(RegexNode node) { + Counter counter = new Counter(); + countHexColonSignals(node, counter); + return counter.hexClasses >= 1 && counter.colons >= 2; + } + + private static boolean isHostnameLabelSequence(RegexNode node) { + Counter counter = new Counter(); + countHostnameSignals(node, counter); + return counter.wordBoundaries >= 1 && counter.hostnameClasses >= 2; + } + + private static void countHexColonSignals(RegexNode node, Counter counter) { + if (node instanceof LiteralNode literal) { + if (literal.ch == ':') counter.colons++; + } else if (node instanceof CharClassNode charClass) { + if (!charClass.negated + && charClass.chars.contains('0') + && charClass.chars.contains('9') + && charClass.chars.contains('A') + && charClass.chars.contains('F') + && charClass.chars.contains('a') + && charClass.chars.contains('f')) { + counter.hexClasses++; + } + } else if (node instanceof ConcatNode concat) { + concat.children.forEach(child -> countHexColonSignals(child, counter)); + } else if (node instanceof AlternationNode alternation) { + alternation.alternatives.forEach(child -> countHexColonSignals(child, counter)); + } else if (node instanceof GroupNode group) { + countHexColonSignals(group.child, counter); + } else if (node instanceof QuantifierNode quantifier) { + countHexColonSignals(quantifier.child, counter); + } + } + + private static void countHostnameSignals(RegexNode node, Counter counter) { + if (node instanceof AnchorNode anchor && anchor.type == AnchorNode.Type.WORD_BOUNDARY) { + counter.wordBoundaries++; + } else if (node instanceof CharClassNode charClass) { + if (!charClass.negated + && charClass.chars.contains('0') + && charClass.chars.contains('9') + && charClass.chars.contains('A') + && charClass.chars.contains('Z') + && charClass.chars.contains('a') + && charClass.chars.contains('z')) { + counter.hostnameClasses++; + } + } else if (node instanceof ConcatNode concat) { + concat.children.forEach(child -> countHostnameSignals(child, counter)); + } else if (node instanceof AlternationNode alternation) { + alternation.alternatives.forEach(child -> countHostnameSignals(child, counter)); + } else if (node instanceof GroupNode group) { + countHostnameSignals(group.child, counter); + } else if (node instanceof QuantifierNode quantifier) { + countHostnameSignals(quantifier.child, counter); + } + } + + private static final class Counter { + int hexClasses; + int colons; + int hostnameClasses; + int wordBoundaries; + } + + private static boolean isDigitRepeat(RegexNode node, int min, int max) { + return node instanceof QuantifierNode quantifier + && quantifier.min == min + && quantifier.max == max + && quantifier.child instanceof CharClassNode charClass + && !charClass.negated + && charClass.chars.equals(CharSet.DIGIT); + } + + private static boolean isWordBoundaryWordBoundary(RegexNode node) { + if (!(node instanceof ConcatNode concat) || concat.children.size() != 3) return false; + return isWordBoundary(concat.children.get(0)) + && isWordPlus(concat.children.get(1)) + && isWordBoundary(concat.children.get(2)); + } + + private static boolean isWordBoundary(RegexNode node) { + return node instanceof AnchorNode anchor && anchor.type == AnchorNode.Type.WORD_BOUNDARY; + } + + private static boolean isWordPlus(RegexNode node) { + return node instanceof QuantifierNode quantifier + && quantifier.min == 1 + && quantifier.max == -1 + && quantifier.child instanceof CharClassNode charClass + && charClass.chars.equals(CharSet.WORD) + && !charClass.negated; + } + + private static boolean isSignedInteger(RegexNode node) { + if (!(node instanceof ConcatNode concat) || concat.children.size() != 2) return false; + return isOptionalSign(concat.children.get(0)) && isDigitPlus(concat.children.get(1)); + } + + private static boolean isDecimalNumber(RegexNode node) { + if (!(node instanceof ConcatNode concat) || concat.children.size() != 3) return false; + return isDigitPlus(concat.children.get(0)) + && concat.children.get(1) instanceof LiteralNode literal + && literal.ch == '.' + && isDigitPlus(concat.children.get(2)); + } + + private static boolean isSignedDecimalNumber(RegexNode node) { + if (!(node instanceof ConcatNode concat)) return false; + if (concat.children.size() == 3) { + return isOptionalSign(concat.children.get(0)) + && isDigitPlus(concat.children.get(1)) + && isOptionalDotDigits(concat.children.get(2)); + } + return concat.children.size() == 2 + && isOptionalSign(concat.children.get(0)) + && isDecimalAlternation(stripNonCapturingGroup(concat.children.get(1))); + } + + private static boolean isDecimalAlternation(RegexNode node) { + if (!(node instanceof AlternationNode alternation)) return false; + return alternation.alternatives.stream().anyMatch(Collector::startsWithDigitPlus) + && alternation.alternatives.stream().anyMatch(Collector::startsWithDotDigitPlus); + } + + private static boolean startsWithDigitPlus(RegexNode node) { + RegexNode child = stripNonCapturingGroup(node); + return child instanceof ConcatNode concat + && !concat.children.isEmpty() + && isDigitPlus(concat.children.get(0)); + } + + private static boolean startsWithDotDigitPlus(RegexNode node) { + RegexNode child = stripNonCapturingGroup(node); + return child instanceof ConcatNode concat + && concat.children.size() == 2 + && concat.children.get(0) instanceof LiteralNode literal + && literal.ch == '.' + && isDigitPlus(concat.children.get(1)); + } + + private static boolean isOptionalSign(RegexNode node) { + return node instanceof QuantifierNode quantifier + && quantifier.min == 0 + && quantifier.max == 1 + && quantifier.child instanceof CharClassNode charClass + && !charClass.negated + && charClass.chars.contains('+') + && charClass.chars.contains('-'); + } + + private static boolean isDigitPlus(RegexNode node) { + return node instanceof QuantifierNode quantifier + && quantifier.min == 1 + && quantifier.max == -1 + && quantifier.child instanceof CharClassNode charClass + && !charClass.negated + && charClass.chars.equals(CharSet.DIGIT); + } + + private static boolean isOptionalDotDigits(RegexNode node) { + if (!(node instanceof QuantifierNode quantifier) + || quantifier.min != 0 + || quantifier.max != 1) { + return false; + } + RegexNode child = stripNonCapturingGroup(quantifier.child); + if (!(child instanceof ConcatNode concat) || concat.children.size() != 2) return false; + return concat.children.get(0) instanceof LiteralNode literal + && literal.ch == '.' + && isDigitPlus(concat.children.get(1)); + } + } +} diff --git a/reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/analysis/StructuralHash.java b/reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/analysis/StructuralHash.java index f0470fb..e7c8f4a 100644 --- a/reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/analysis/StructuralHash.java +++ b/reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/analysis/StructuralHash.java @@ -17,6 +17,7 @@ import com.datadoghq.reggie.codegen.automaton.DFA; import com.datadoghq.reggie.codegen.automaton.NFA; +import java.util.EnumSet; /** * Computes structural hash for pattern analysis results to enable bytecode caching. @@ -27,6 +28,17 @@ * *

The hash does NOT include: - Pattern string (generated identifiers) - Class names or UUIDs * (generated identifiers) + * + *

Implementation note on hash width: all public methods return {@code long} (64-bit). The + * Level-2 structural cache key must be 64-bit to make birthday collisions essentially impossible + * for realistic pattern sets (~10 k patterns → P_collision ≈ 2.7 × 10⁻⁹). Using an {@code int} key + * was observed to cause silent misidentification of structurally distinct patterns as cache hits, + * producing wrong match results. + * + *

Implementation note on enum hashing: all enum values are hashed via {@link + * Enum#ordinal()} rather than {@link Object#hashCode()} because {@code hashCode()} delegates to + * {@code System.identityHashCode()}, which is not guaranteed to be non-zero. A zero identity hash + * makes a non-empty {@link EnumSet} indistinguishable from an empty one. */ public final class StructuralHash { @@ -40,44 +52,59 @@ private StructuralHash() { * @param result Analysis result containing strategy and DFA * @param nfa NFA containing group count * @param caseInsensitive Whether backreferences use case-insensitive comparison - * @return Hash code representing the structural equivalence class + * @return 64-bit hash representing the structural equivalence class */ - public static int compute( + public static long compute( PatternAnalyzer.MatchingStrategyResult result, NFA nfa, boolean caseInsensitive) { - int hash = 17; + long hash = 17L; + + hash = 31L * hash + result.strategy.ordinal(); + hash = 31L * hash + nfa.getGroupCount(); + hash = 31L * hash + (result.useTaggedDFA ? 1 : 0); + hash = 31L * hash + (result.usePosixLastMatch ? 1 : 0); + hash = 31L * hash + (caseInsensitive ? 1 : 0); + + // Each anchor type is a separate bit so that patterns differing only in which + // anchor they use always produce different hashes. + hash = 31L * hash + (nfa.hasEndAnchor() ? 1 : 0); // $ + hash = 31L * hash + (nfa.hasStartAnchor() ? 1 : 0); // ^ + hash = 31L * hash + (nfa.hasStringEndAbsoluteAnchor() ? 1 : 0); // \z + hash = 31L * hash + (nfa.hasStringEndAnchor() ? 1 : 0); // \Z + hash = 31L * hash + (nfa.hasStringStartAnchor() ? 1 : 0); // \A + hash = 31L * hash + (nfa.hasMultilineStartAnchor() ? 1 : 0); // ^ in (?m) + hash = 31L * hash + (nfa.hasMultilineEndAnchor() ? 1 : 0); // $ in (?m) - // Strategy is the primary discriminator - hash = 31 * hash + result.strategy.hashCode(); + if (result.dfa != null) { + hash = 31L * hash + computeDFATopologyHash(result.dfa); + } - // Group count affects bytecode structure (memory allocation, array sizes) - hash = 31 * hash + nfa.getGroupCount(); + hash = 31L * hash + nfa.contentHashCode(); - // Feature flags affect code generation - hash = 31 * hash + (result.useTaggedDFA ? 1 : 0); - hash = 31 * hash + (result.usePosixLastMatch ? 1 : 0); + if (result.patternInfo != null) { + hash = 31L * hash + result.patternInfo.structuralHashCode(); + } - // Case-insensitive backreference comparison affects bytecode generation - hash = 31 * hash + (caseInsensitive ? 1 : 0); + return hash; + } - // Anchor types affect bytecode (different end-of-string semantics) - // \z (STRING_END_ABSOLUTE) matches only at absolute end - // \Z (STRING_END) matches at end or before final newline - hash = 31 * hash + (nfa.hasStringEndAbsoluteAnchor() ? 1 : 0); - hash = 31 * hash + (nfa.hasStringEndAnchor() ? 1 : 0); - hash = 31 * hash + (nfa.hasStringStartAnchor() ? 1 : 0); + /** + * Compute hash without NFA (for patterns that skip NFA construction, e.g. RECURSIVE_DESCENT). + * + * @param result Analysis result + * @return 64-bit hash representing the structural equivalence class + */ + public static long computeWithoutGroupCount(PatternAnalyzer.MatchingStrategyResult result) { + long hash = 17L; + hash = 31L * hash + result.strategy.ordinal(); + hash = 31L * hash + (result.useTaggedDFA ? 1 : 0); + hash = 31L * hash + (result.usePosixLastMatch ? 1 : 0); - // DFA content hash (if applicable) if (result.dfa != null) { - hash = 31 * hash + computeDFATopologyHash(result.dfa); + hash = 31L * hash + computeDFATopologyHash(result.dfa); } - // NFA content hash - includes character sets which are critical for - // distinguishing patterns with different case-sensitivity - hash = 31 * hash + nfa.contentHashCode(); - - // PatternInfo provides structural hash including class type if (result.patternInfo != null) { - hash = 31 * hash + result.patternInfo.structuralHashCode(); + hash = 31L * hash + result.patternInfo.structuralHashCode(); } return hash; @@ -86,74 +113,48 @@ public static int compute( /** * Compute hash of DFA including topology and content. * - *

Includes: - State count - Transition counts per state - Accept state count - Max out-degree - * (transition density) - Character sets on transitions (critical for case-sensitivity) + *

Includes: state count, transition counts, accept state count, max out-degree, character sets + * on transitions, per-state acceptance anchor conditions, per-transition entry guards. * - *

Excludes: - State IDs (generated identifiers) + *

Excludes: state IDs (generated identifiers). */ - private static int computeDFATopologyHash(DFA dfa) { - int hash = 1; - - // State count is a primary structural characteristic - hash = 31 * hash + dfa.getStateCount(); + private static long computeDFATopologyHash(DFA dfa) { + long hash = 1L; - // Accept state count - hash = 31 * hash + dfa.getAcceptStates().size(); + hash = 31L * hash + dfa.getStateCount(); + hash = 31L * hash + dfa.getAcceptStates().size(); + hash = 31L * hash + dfa.getMaxOutDegree(); - // Max out-degree (affects code generation strategy: unrolled vs switch vs table) - hash = 31 * hash + dfa.getMaxOutDegree(); - - // Transition structure including character sets - // Character sets MUST be included to distinguish patterns with different - // case-sensitivity (e.g., "(ab)c" vs "(a(?i)b)c") for (DFA.DFAState state : dfa.getAllStates()) { - // Number of outgoing transitions - hash = 31 * hash + state.transitions.size(); + hash = 31L * hash + state.transitions.size(); + hash = 31L * hash + (state.accepting ? 1 : 0); - // Accepting state flag - hash = 31 * hash + (state.accepting ? 1 : 0); + // Acceptance anchor conditions: use ordinal bitmask, not EnumSet.hashCode(), because + // System.identityHashCode() can return 0, making {END} look the same as {}. + hash = 31L * hash + anchorBitmask(state.acceptanceAnchorConditions); - // Group action count (affects bytecode for group tracking) - hash = 31 * hash + state.groupActions.size(); + hash = 31L * hash + state.groupActions.size(); - // Assertion checks — include type to distinguish positive from negative assertions - hash = 31 * hash + state.assertionChecks.size(); + hash = 31L * hash + state.assertionChecks.size(); for (var ac : state.assertionChecks) { - hash = 31 * hash + ac.type.hashCode(); + hash = 31L * hash + ac.type.ordinal(); } - // Include character sets - critical for case sensitivity for (var entry : state.transitions.entrySet()) { - hash = 31 * hash + entry.getKey().hashCode(); + hash = 31L * hash + entry.getKey().hashCode(); + hash = 31L * hash + anchorBitmask(entry.getValue().entryGuard); } } return hash; } - /** - * Computes a stable hash for a pattern analysis result. - * - *

This is a convenience method for testing and debugging that doesn't require an NFA. For - * production use, prefer compute(result, nfa) which includes group count. - * - * @param result Analysis result - * @return Hash code (less accurate without group count) - */ - public static int computeWithoutGroupCount(PatternAnalyzer.MatchingStrategyResult result) { - int hash = 17; - hash = 31 * hash + result.strategy.hashCode(); - hash = 31 * hash + (result.useTaggedDFA ? 1 : 0); - hash = 31 * hash + (result.usePosixLastMatch ? 1 : 0); - - if (result.dfa != null) { - hash = 31 * hash + computeDFATopologyHash(result.dfa); - } - - if (result.patternInfo != null) { - hash = 31 * hash + result.patternInfo.structuralHashCode(); + /** Stable bitmask over an EnumSet of AnchorType using ordinal() values. */ + private static int anchorBitmask(EnumSet anchors) { + int mask = 0; + for (NFA.AnchorType a : anchors) { + mask |= (1 << a.ordinal()); } - - return hash; + return mask; } } diff --git a/reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/automaton/DFA.java b/reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/automaton/DFA.java index 39f9025..37618ca 100644 --- a/reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/automaton/DFA.java +++ b/reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/automaton/DFA.java @@ -26,10 +26,31 @@ public final class DFA { private final Set acceptStates; private final List allStates; + /** + * True when subset construction detected an anchor-condition dilution: two contributors in the + * same partition slice (or two accept states) had differing anchor conditions, and intersection + * collapsed them to unconditional. The DFA is structurally valid but may accept inputs that a + * correctly-anchored automaton would reject. Callers should route to a non-DFA engine. + */ + private final boolean anchorConditionDiluted; + public DFA(DFAState startState, Set acceptStates, List allStates) { + this(startState, acceptStates, allStates, false); + } + + public DFA( + DFAState startState, + Set acceptStates, + List allStates, + boolean anchorConditionDiluted) { this.startState = startState; this.acceptStates = acceptStates; this.allStates = allStates; + this.anchorConditionDiluted = anchorConditionDiluted; + } + + public boolean isAnchorConditionDiluted() { + return anchorConditionDiluted; } public DFAState getStartState() { @@ -166,18 +187,36 @@ public static final class DFATransition { public final DFAState target; public final List tagOps; // Tag operations to perform on this transition + /** + * Anchor preconditions that must hold at the *source* position (before consuming a character). + * Empty = unconditional. Populated when the contributing NFA path crosses a START-class anchor + * (START / STRING_START / START_MULTILINE) before consuming. Codegen emits a position guard. + */ + public final EnumSet entryGuard; + public DFATransition(DFAState target) { - this(target, Collections.emptyList()); + this(target, Collections.emptyList(), EnumSet.noneOf(NFA.AnchorType.class)); } public DFATransition(DFAState target, List tagOps) { + this(target, tagOps, EnumSet.noneOf(NFA.AnchorType.class)); + } + + public DFATransition( + DFAState target, List tagOps, EnumSet entryGuard) { this.target = target; this.tagOps = tagOps; + this.entryGuard = entryGuard; } @Override public String toString() { - return "Transition{target=" + target.id + ", tags=" + tagOps.size() + "}"; + return "Transition{target=" + + target.id + + ", tags=" + + tagOps.size() + + (entryGuard.isEmpty() ? "" : ", guard=" + entryGuard) + + "}"; } } @@ -191,6 +230,16 @@ public static final class DFAState { assertionChecks; // Assertions to check in this state (prototype) public final List groupActions; // Group capture actions when entering this state + /** + * Anchor preconditions that must hold *at the current input position* for this state to be + * considered accepting. Empty = unconditional (existing {@link #accepting} flag semantics). + * Populated when the only paths to an NFA accept state cross END-class anchors (END / + * STRING_END / STRING_END_ABSOLUTE / END_MULTILINE) or START-class anchors (START / + * STRING_START / START_MULTILINE). Codegen emits the corresponding position check before + * accepting. + */ + public final EnumSet acceptanceAnchorConditions; + public DFAState(int id, Set nfaStates, boolean accepting) { this(id, nfaStates, accepting, new ArrayList<>(), new ArrayList<>()); } @@ -209,11 +258,28 @@ public DFAState( boolean accepting, List assertionChecks, List groupActions) { + this( + id, + nfaStates, + accepting, + assertionChecks, + groupActions, + EnumSet.noneOf(NFA.AnchorType.class)); + } + + public DFAState( + int id, + Set nfaStates, + boolean accepting, + List assertionChecks, + List groupActions, + EnumSet acceptanceAnchorConditions) { this.id = id; this.nfaStates = nfaStates; this.accepting = accepting; this.assertionChecks = assertionChecks; this.groupActions = groupActions; + this.acceptanceAnchorConditions = acceptanceAnchorConditions; this.transitions = new LinkedHashMap<>(); } @@ -225,6 +291,14 @@ public void addTransition(CharSet chars, DFAState target, List tag transitions.put(chars, new DFATransition(target, tagOps)); } + public void addTransition( + CharSet chars, + DFAState target, + List tagOps, + EnumSet entryGuard) { + transitions.put(chars, new DFATransition(target, tagOps, entryGuard)); + } + @Override public boolean equals(Object o) { if (this == o) return true; diff --git a/reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/automaton/DFATableData.java b/reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/automaton/DFATableData.java new file mode 100644 index 0000000..8c3cdf4 --- /dev/null +++ b/reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/automaton/DFATableData.java @@ -0,0 +1,220 @@ +/* + * Copyright 2026-Present Datadog, Inc. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package com.datadoghq.reggie.codegen.automaton; + +import java.util.ArrayList; +import java.util.Arrays; +import java.util.HashMap; +import java.util.List; +import java.util.Map; +import java.util.TreeSet; + +/** Compact table representation for large pure DFA matchers. */ +public final class DFATableData { + public final int startState; + public final int stateSlots; + public final int classCount; + public final int[] transitions; + public final boolean[] accepting; + public final int[] asciiClasses; + public final boolean[] startAscii; + public final char[] rangeStarts; + public final char[] rangeEnds; + public final int[] rangeClasses; + + private DFATableData( + int startState, + int stateSlots, + int classCount, + int[] transitions, + boolean[] accepting, + int[] asciiClasses, + boolean[] startAscii, + char[] rangeStarts, + char[] rangeEnds, + int[] rangeClasses) { + this.startState = startState; + this.stateSlots = stateSlots; + this.classCount = classCount; + this.transitions = transitions; + this.accepting = accepting; + this.asciiClasses = asciiClasses; + this.startAscii = startAscii; + this.rangeStarts = rangeStarts; + this.rangeEnds = rangeEnds; + this.rangeClasses = rangeClasses; + } + + public static DFATableData from(DFA dfa) { + int maxStateId = 0; + for (DFA.DFAState state : dfa.getAllStates()) { + maxStateId = Math.max(maxStateId, state.id); + } + int stateSlots = maxStateId + 1; + + TreeSet boundaries = new TreeSet<>(); + boundaries.add(0); + boundaries.add(0x10000); + for (DFA.DFAState state : dfa.getAllStates()) { + for (CharSet chars : state.transitions.keySet()) { + for (CharSet.Range range : chars.getRanges()) { + boundaries.add((int) range.start); + if (range.end != Character.MAX_VALUE) { + boundaries.add(((int) range.end) + 1); + } + } + } + } + + List points = new ArrayList<>(boundaries); + Map classIds = new HashMap<>(); + List classVectors = new ArrayList<>(); + List rangeStarts = new ArrayList<>(); + List rangeEnds = new ArrayList<>(); + List rangeClasses = new ArrayList<>(); + + for (int i = 0; i < points.size() - 1; i++) { + int start = points.get(i); + int endExclusive = points.get(i + 1); + if (start >= endExclusive) { + continue; + } + + char representative = (char) start; + int[] vector = new int[stateSlots]; + Arrays.fill(vector, -1); + for (DFA.DFAState state : dfa.getAllStates()) { + for (Map.Entry entry : state.transitions.entrySet()) { + if (entry.getKey().contains(representative)) { + vector[state.id] = entry.getValue().target.id; + break; + } + } + } + + VectorKey key = new VectorKey(vector); + Integer classId = classIds.get(key); + if (classId == null) { + classId = classVectors.size(); + classIds.put(key, classId); + classVectors.add(vector); + } + + char rangeStart = (char) start; + char rangeEnd = (char) (endExclusive - 1); + int last = rangeClasses.size() - 1; + if (last >= 0 + && rangeClasses.get(last).intValue() == classId + && ((int) rangeEnds.get(last)) + 1 == start) { + rangeEnds.set(last, rangeEnd); + } else { + rangeStarts.add(rangeStart); + rangeEnds.add(rangeEnd); + rangeClasses.add(classId); + } + } + + int classCount = classVectors.size(); + int[] transitions = new int[stateSlots * classCount]; + Arrays.fill(transitions, -1); + for (int classId = 0; classId < classCount; classId++) { + int[] vector = classVectors.get(classId); + for (int state = 0; state < stateSlots; state++) { + transitions[state * classCount + classId] = vector[state]; + } + } + + boolean[] accepting = new boolean[stateSlots]; + for (DFA.DFAState state : dfa.getAcceptStates()) { + accepting[state.id] = true; + } + + int[] asciiClasses = new int[128]; + boolean[] startAscii = new boolean[128]; + for (int ch = 0; ch < asciiClasses.length; ch++) { + asciiClasses[ch] = classFor((char) ch, rangeStarts, rangeEnds, rangeClasses); + startAscii[ch] = transitions[dfa.getStartState().id * classCount + asciiClasses[ch]] >= 0; + } + + return new DFATableData( + dfa.getStartState().id, + stateSlots, + classCount, + transitions, + accepting, + asciiClasses, + startAscii, + toCharArray(rangeStarts), + toCharArray(rangeEnds), + toIntArray(rangeClasses)); + } + + public int estimatedBytes() { + return transitions.length * Integer.BYTES + + accepting.length + + asciiClasses.length * Integer.BYTES + + startAscii.length + + rangeStarts.length * Character.BYTES + + rangeEnds.length * Character.BYTES + + rangeClasses.length * Integer.BYTES; + } + + private static int classFor( + char ch, List starts, List ends, List classes) { + for (int i = 0; i < starts.size(); i++) { + if (ch >= starts.get(i) && ch <= ends.get(i)) { + return classes.get(i); + } + } + return 0; + } + + private static char[] toCharArray(List values) { + char[] result = new char[values.size()]; + for (int i = 0; i < values.size(); i++) { + result[i] = values.get(i); + } + return result; + } + + private static int[] toIntArray(List values) { + int[] result = new int[values.size()]; + for (int i = 0; i < values.size(); i++) { + result[i] = values.get(i); + } + return result; + } + + private static final class VectorKey { + private final int[] vector; + private final int hash; + + private VectorKey(int[] vector) { + this.vector = vector; + this.hash = Arrays.hashCode(vector); + } + + @Override + public boolean equals(Object obj) { + return obj instanceof VectorKey && Arrays.equals(vector, ((VectorKey) obj).vector); + } + + @Override + public int hashCode() { + return hash; + } + } +} diff --git a/reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/automaton/NFA.java b/reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/automaton/NFA.java index b552a4b..51fc2dd 100644 --- a/reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/automaton/NFA.java +++ b/reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/automaton/NFA.java @@ -199,6 +199,12 @@ private boolean requiresAnchorOnAllPaths(AnchorType... barriers) { if (!state.getTransitions().isEmpty()) { return false; } + // If accept is reachable without crossing the barrier (e.g. via END anchor for bare `$`), + // the pattern can match at positions other than the start, so the find()-loop "only try + // tryPos == 0" optimization is unsound. + if (acceptStates.contains(state)) { + return false; + } for (NFAState next : state.getEpsilonTransitions()) { if (visited.add(next)) { queue.add(next); @@ -300,8 +306,16 @@ public int contentHashCode() { hash = 31 * hash + (state.enterGroup != null ? state.enterGroup + 1 : 0); hash = 31 * hash + (state.exitGroup != null ? state.exitGroup + 1 : 0); - // Assertion type distinguishes (?<=...) from (?, DFA.DFAState> stateCache; private List allStates; private int nextStateId; + private boolean anchorConditionDiluted; public DFA buildDFA(NFA nfa) throws StateExplosionException { return buildDFA(nfa, false); @@ -43,62 +44,117 @@ public DFA buildDFA(NFA nfa, boolean computeTags) throws StateExplosionException this.stateCache = new HashMap<>(); this.allStates = new ArrayList<>(); this.nextStateId = 0; - - // Pre-compute epsilon closures for all NFA states - Map> epsilonClosures = precomputeEpsilonClosures(nfa); - - // Start with epsilon-closure of NFA start state - Set startClosure = epsilonClosures.get(nfa.getStartState()); - boolean startAccepting = containsAcceptState(startClosure, nfa.getAcceptStates()); - List startGroupActions = computeGroupActions(startClosure); + this.anchorConditionDiluted = false; + + // Pre-compute anchor-aware epsilon closures for all NFA states. Each entry maps a reachable + // NFA state to the weakest conjunction of anchors that must hold at the current input + // position for that state to be live. + Map>> anchoredClosures = + precomputeAnchoredClosures(nfa); + + // Start with anchored epsilon-closure of NFA start state + Map> startClosure = + anchoredClosures.get(nfa.getStartState()); + Set startClosureSet = startClosure.keySet(); + List startGroupActions = computeGroupActions(startClosureSet); + EnumSet startAcceptConditions = + computeAcceptanceConditions(startClosure, nfa.getAcceptStates()); + boolean startAccepting = + containsAcceptState(startClosureSet, nfa.getAcceptStates()) + || !startAcceptConditions.isEmpty(); DFA.DFAState start = new DFA.DFAState( - nextStateId++, startClosure, startAccepting, new ArrayList<>(), startGroupActions); - stateCache.put(startClosure, start); + nextStateId++, + startClosureSet, + startAccepting, + new ArrayList<>(), + startGroupActions, + startAcceptConditions); + stateCache.put(startClosureSet, start); allStates.add(start); Queue worklist = new ArrayDeque<>(); worklist.add(start); + // Per-DFA-state anchor conditions, mirroring DFAState.nfaStates set membership. + Map>> dfaStateConditions = + new HashMap<>(); + dfaStateConditions.put(start, startClosure); while (!worklist.isEmpty()) { DFA.DFAState current = worklist.poll(); + Map> currentConditions = + dfaStateConditions.get(current); // Compute disjoint partition of outgoing character sets List partition = computeDisjointPartition(current.nfaStates); for (CharSet chars : partition) { - // Find all NFA states reachable on this charset - Set targets = new HashSet<>(); + // Find all NFA states reachable on this charset, along with the weakest anchor + // condition required at the *source* position to take any contributing transition. + Map> targetsWithCond = new HashMap<>(); + EnumSet transitionGuard = null; // weakest across contributing sources + boolean transitionHasContributor = false; + boolean anyNonEmptySrcCond = false; for (NFA.NFAState nfaState : current.nfaStates) { + EnumSet srcCond = currentConditions.get(nfaState); + if (srcCond == null) continue; // unreachable + // END-class anchors require pos == length; they cannot gate a consuming transition. + if (containsConsumeKillingAnchor(srcCond)) continue; for (NFA.Transition trans : nfaState.getTransitions()) { if (trans.chars.intersects(chars)) { - // Add epsilon closure of target state - targets.addAll(epsilonClosures.get(trans.target)); + transitionHasContributor = true; + if (!srcCond.isEmpty()) anyNonEmptySrcCond = true; + transitionGuard = mergeWeakest(transitionGuard, srcCond); + // After consuming a char, prior conditions are discharged. The post-consume + // closure carries its own conditions starting from the transition target. + Map> postClosure = + anchoredClosures.get(trans.target); + for (Map.Entry> e : postClosure.entrySet()) { + targetsWithCond.merge( + e.getKey(), EnumSet.copyOf(e.getValue()), SubsetConstructor::mergeWeakestInto); + } } } } - if (targets.isEmpty()) continue; + if (!transitionHasContributor || targetsWithCond.isEmpty()) continue; + if (transitionGuard == null) transitionGuard = EnumSet.noneOf(NFA.AnchorType.class); + // Anchor dilution: an unconditional contributor erased a non-empty anchor guard. + if (transitionGuard.isEmpty() && anyNonEmptySrcCond) anchorConditionDiluted = true; + + Set targets = targetsWithCond.keySet(); // Get or create DFA state DFA.DFAState target = stateCache.get(targets); if (target == null) { - boolean accepting = containsAcceptState(targets, nfa.getAcceptStates()); + EnumSet targetAcceptConditions = + computeAcceptanceConditions(targetsWithCond, nfa.getAcceptStates()); + boolean accepting = + containsAcceptState(targets, nfa.getAcceptStates()) + || !targetAcceptConditions.isEmpty(); List groupActions = computeGroupActions(targets); target = - new DFA.DFAState(nextStateId++, targets, accepting, new ArrayList<>(), groupActions); + new DFA.DFAState( + nextStateId++, + targets, + accepting, + new ArrayList<>(), + groupActions, + targetAcceptConditions); stateCache.put(targets, target); allStates.add(target); + dfaStateConditions.put(target, targetsWithCond); worklist.add(target); } // Compute tag operations if requested (Tagged DFA) if (computeTags && nfa.getGroupCount() > 0) { List tagOps = - computeTagOperations(current.nfaStates, targets, chars, epsilonClosures); - current.addTransition(chars, target, tagOps); + computeTagOperations( + current.nfaStates, targets, chars, flattenClosure(anchoredClosures)); + current.addTransition(chars, target, tagOps, transitionGuard); } else { - current.addTransition(chars, target); + current.addTransition(chars, target, Collections.emptyList(), transitionGuard); } } @@ -111,7 +167,7 @@ public DFA buildDFA(NFA nfa, boolean computeTags) throws StateExplosionException Set acceptStates = allStates.stream().filter(s -> s.accepting).collect(java.util.stream.Collectors.toSet()); - return new DFA(start, acceptStates, allStates); + return new DFA(start, acceptStates, allStates, anchorConditionDiluted); } /** @@ -147,6 +203,158 @@ private void computeEpsilonClosure(NFA.NFAState start, Set closure } } + /** + * Pre-compute anchor-aware epsilon closures: for each NFA state, a map from each ε-reachable + * state to the *weakest conjunction of anchor types* that must hold at the current input position + * to live there. An empty {@link EnumSet} means unconditional reachability. + */ + private Map>> precomputeAnchoredClosures( + NFA nfa) { + Map>> closures = new HashMap<>(); + for (NFA.NFAState state : nfa.getStates()) { + closures.put(state, computeAnchoredEpsilonClosure(state)); + } + return closures; + } + + /** + * Compute the anchor-aware ε-closure from {@code start}. When a state in the BFS frontier has + * {@code anchor != null}, that anchor is added to the condition under which each ε-successor is + * reachable. Multiple paths to the same state merge to the weakest conjunction (intersection). + */ + private Map> computeAnchoredEpsilonClosure( + NFA.NFAState start) { + Map> result = new HashMap<>(); + result.put(start, EnumSet.noneOf(NFA.AnchorType.class)); + Deque worklist = new ArrayDeque<>(); + worklist.add(start); + while (!worklist.isEmpty()) { + NFA.NFAState current = worklist.poll(); + EnumSet currentCond = result.get(current); + EnumSet propagated; + if (current.anchor != null && isPositionAnchor(current.anchor)) { + propagated = EnumSet.copyOf(currentCond); + propagated.add(current.anchor); + } else { + propagated = currentCond; + } + for (NFA.NFAState target : current.getEpsilonTransitions()) { + EnumSet existing = result.get(target); + if (existing == null) { + result.put(target, EnumSet.copyOf(propagated)); + worklist.add(target); + } else { + // Weakest wins: intersection of existing and propagated. If that loosens the + // requirement, store and re-propagate. + EnumSet merged = EnumSet.copyOf(existing); + merged.retainAll(propagated); + if (!merged.equals(existing)) { + // Two non-empty but disjoint anchor sets meeting at the same state: their + // intersection is empty (unconditional), erasing both anchors. + if (merged.isEmpty() && !existing.isEmpty() && !propagated.isEmpty()) { + anchorConditionDiluted = true; + } + result.put(target, merged); + worklist.add(target); + } + } + } + } + return result; + } + + /** + * Returns true if the given anchor type is one this fix knows how to gate at the DFA level. Word + * boundaries and reset-match anchors are handled elsewhere; they are not treated as positional + * gating here. + */ + private static boolean isPositionAnchor(NFA.AnchorType type) { + switch (type) { + case START: + case END: + case START_MULTILINE: + case END_MULTILINE: + case STRING_START: + case STRING_END: + case STRING_END_ABSOLUTE: + return true; + case WORD_BOUNDARY: + case RESET_MATCH: + default: + return false; + } + } + + /** + * Returns true if any anchor in the set requires {@code pos == length} (or near-end), which makes + * a consuming char-transition impossible. Used to prune dead transitions. + */ + private static boolean containsConsumeKillingAnchor(EnumSet conds) { + return conds.contains(NFA.AnchorType.END) || conds.contains(NFA.AnchorType.STRING_END_ABSOLUTE); + // Note: STRING_END (\Z) and END_MULTILINE allow consuming a final newline, but precise + // handling there would require char-set intersection. Conservative pruning is safe for + // the present scope; an extension can refine if needed. + } + + /** + * Compute weakest acceptance conditions across all accept NFA states in {@code closure}. Returns + * an empty set if any accept state is unconditionally reachable; otherwise the weakest + * single-conjunction condition. Callers treat empty as "unconditionally accepting". + * + *

Side effect: sets {@link #anchorConditionDiluted} when multiple accept states have non-empty + * but disjoint conditions whose intersection collapses to empty. + */ + private EnumSet computeAcceptanceConditions( + Map> closure, Set acceptStates) { + EnumSet best = null; + for (NFA.NFAState s : closure.keySet()) { + if (!acceptStates.contains(s)) continue; + EnumSet cond = closure.get(s); + if (cond.isEmpty()) return EnumSet.noneOf(NFA.AnchorType.class); + if (best == null) best = EnumSet.copyOf(cond); + else best.retainAll(cond); + } + if (best != null && best.isEmpty()) { + // All accept states had non-empty conditions, but they were disjoint — intersection + // collapsed to empty (unconditional). The DFA would accept without checking any anchor. + anchorConditionDiluted = true; + } + return best == null ? EnumSet.noneOf(NFA.AnchorType.class) : best; + } + + /** Merge two weakest-condition values via intersection. */ + private static EnumSet mergeWeakest( + EnumSet a, EnumSet b) { + if (a == null) return EnumSet.copyOf(b); + if (b == null) return a; + EnumSet r = EnumSet.copyOf(a); + r.retainAll(b); + return r; + } + + /** {@link Map#merge} remapping function for weakest-condition merging. */ + private static EnumSet mergeWeakestInto( + EnumSet existing, EnumSet incoming) { + EnumSet r = EnumSet.copyOf(existing); + r.retainAll(incoming); + return r; + } + + /** + * Flatten anchored-closure data structure back to the legacy {@code Map>} + * shape consumed by tag-operation computation, which only cares about set membership, not anchor + * conditions. + */ + private static Map> flattenClosure( + Map>> anchored) { + Map> flat = new HashMap<>(); + for (Map.Entry>> e : + anchored.entrySet()) { + flat.put(e.getKey(), e.getValue().keySet()); + } + return flat; + } + /** * Critical algorithm: splits overlapping character sets into disjoint ranges. Example: [a-z] and * [e-m] → [a-d], [e-m], [n-z] @@ -532,26 +740,38 @@ public DFA buildDFAWithAssertions(NFA nfa) throws StateExplosionException { this.stateCache = new HashMap<>(); this.allStates = new ArrayList<>(); this.nextStateId = 0; + this.anchorConditionDiluted = false; - // Pre-compute epsilon closures for all NFA states - Map> epsilonClosures = precomputeEpsilonClosures(nfa); + // Pre-compute anchor-aware epsilon closures + Map>> anchoredClosures = + precomputeAnchoredClosures(nfa); // Extract assertions from NFA Map> assertionMap = extractAssertions(nfa); - // Start with epsilon-closure of NFA start state - Set startClosure = epsilonClosures.get(nfa.getStartState()); - boolean startAccepting = containsAcceptState(startClosure, nfa.getAcceptStates()); + // Start with anchored epsilon-closure of NFA start state + Map> startClosure = + anchoredClosures.get(nfa.getStartState()); + Set startClosureSet = startClosure.keySet(); DFA.DFAState start = - createDFAStateWithAssertions(startClosure, assertionMap, nfa.getAcceptStates()); - stateCache.put(startClosure, start); + createDFAStateWithAssertions( + startClosureSet, + assertionMap, + nfa.getAcceptStates(), + computeAcceptanceConditions(startClosure, nfa.getAcceptStates())); + stateCache.put(startClosureSet, start); allStates.add(start); Queue worklist = new ArrayDeque<>(); worklist.add(start); + Map>> dfaStateConditions = + new HashMap<>(); + dfaStateConditions.put(start, startClosure); while (!worklist.isEmpty()) { DFA.DFAState current = worklist.poll(); + Map> currentConditions = + dfaStateConditions.get(current); // State explosion check (threshold: 300 states) if (allStates.size() > 300) { @@ -562,32 +782,47 @@ public DFA buildDFAWithAssertions(NFA nfa) throws StateExplosionException { List partition = computeDisjointPartition(current.nfaStates); for (CharSet chars : partition) { - // Find all NFA states reachable on this charset - Set targets = new HashSet<>(); + Map> targetsWithCond = new HashMap<>(); + EnumSet transitionGuard = null; + boolean hasContributor = false; + boolean anyNonEmptySrcCond = false; for (NFA.NFAState nfaState : current.nfaStates) { + EnumSet srcCond = currentConditions.get(nfaState); + if (srcCond == null) continue; + if (containsConsumeKillingAnchor(srcCond)) continue; for (NFA.Transition trans : nfaState.getTransitions()) { if (trans.chars.intersects(chars)) { - // Add epsilon closure of target state - targets.addAll(epsilonClosures.get(trans.target)); + hasContributor = true; + if (!srcCond.isEmpty()) anyNonEmptySrcCond = true; + transitionGuard = mergeWeakest(transitionGuard, srcCond); + for (Map.Entry> e : + anchoredClosures.get(trans.target).entrySet()) { + targetsWithCond.merge( + e.getKey(), EnumSet.copyOf(e.getValue()), SubsetConstructor::mergeWeakestInto); + } } } } - if (!targets.isEmpty()) { - // Get or create DFA state for this target set - DFA.DFAState targetState = stateCache.get(targets); - if (targetState == null) { - boolean accepting = containsAcceptState(targets, nfa.getAcceptStates()); - targetState = - createDFAStateWithAssertions(targets, assertionMap, nfa.getAcceptStates()); - stateCache.put(targets, targetState); - allStates.add(targetState); - worklist.add(targetState); - } - - // Add transition - current.addTransition(chars, targetState); + if (!hasContributor || targetsWithCond.isEmpty()) continue; + if (transitionGuard == null) transitionGuard = EnumSet.noneOf(NFA.AnchorType.class); + if (transitionGuard.isEmpty() && anyNonEmptySrcCond) anchorConditionDiluted = true; + + Set targets = targetsWithCond.keySet(); + DFA.DFAState targetState = stateCache.get(targets); + if (targetState == null) { + targetState = + createDFAStateWithAssertions( + targets, + assertionMap, + nfa.getAcceptStates(), + computeAcceptanceConditions(targetsWithCond, nfa.getAcceptStates())); + stateCache.put(targets, targetState); + allStates.add(targetState); + dfaStateConditions.put(targetState, targetsWithCond); + worklist.add(targetState); } + current.addTransition(chars, targetState, Collections.emptyList(), transitionGuard); } } @@ -595,7 +830,7 @@ public DFA buildDFAWithAssertions(NFA nfa) throws StateExplosionException { Set acceptStates = allStates.stream().filter(s -> s.accepting).collect(java.util.stream.Collectors.toSet()); - return new DFA(start, acceptStates, allStates); + return new DFA(start, acceptStates, allStates, anchorConditionDiluted); } /** Helper class to hold assertion extraction results. */ @@ -841,17 +1076,30 @@ private DFA.DFAState createDFAStateWithAssertions( Set nfaStates, Map> assertionMap, Set acceptStates) { + return createDFAStateWithAssertions( + nfaStates, assertionMap, acceptStates, EnumSet.noneOf(NFA.AnchorType.class)); + } + + /** Create DFA state with assertion annotations, group actions, and acceptance anchor cond. */ + private DFA.DFAState createDFAStateWithAssertions( + Set nfaStates, + Map> assertionMap, + Set acceptStates, + EnumSet acceptanceAnchorConditions) { List assertions = new ArrayList<>(); List groupActions = computeGroupActions(nfaStates); + boolean accepting = + containsAcceptState(nfaStates, acceptStates) || !acceptanceAnchorConditions.isEmpty(); DFA.DFAState dfaState = new DFA.DFAState( nextStateId++, nfaStates, - containsAcceptState(nfaStates, acceptStates), + accepting, assertions, - groupActions); + groupActions, + acceptanceAnchorConditions); // Aggregate assertions from all NFA states for (NFA.NFAState nfaState : nfaStates) { diff --git a/reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/automaton/ThompsonBuilder.java b/reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/automaton/ThompsonBuilder.java index 1446cc0..24791b7 100644 --- a/reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/automaton/ThompsonBuilder.java +++ b/reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/automaton/ThompsonBuilder.java @@ -211,6 +211,15 @@ private NFAFragment buildCountedQuantifier(RegexNode child, int min, int max) { NFAFragment result = fragments.get(0); Set allExits = new HashSet<>(); + // When min == 0, the first fragment is itself optional — the whole quantifier can match + // zero times by going directly from entry to exit. The chain loop below only marks + // fragments[1..] as optional (via the i >= min check, which is always true for min == 0 + // starting from i=1, but never includes fragments[0] itself), so the 0-reps path was + // missing. Without this, c{0,3} could only match 1, 2, or 3 c's, never 0. + if (min == 0) { + allExits.add(result.entry); + } + for (int i = 1; i < fragments.size(); i++) { NFAFragment next = fragments.get(i); diff --git a/reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/codegen/BoundedQuantifierBytecodeGenerator.java b/reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/codegen/BoundedQuantifierBytecodeGenerator.java index e1ca74c..57f7584 100644 --- a/reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/codegen/BoundedQuantifierBytecodeGenerator.java +++ b/reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/codegen/BoundedQuantifierBytecodeGenerator.java @@ -917,7 +917,16 @@ private int generateBoundedQuantifierMatching( mv.visitLabel(loopEnd); - // Minimum already satisfied (matchCount >= 1), no check needed + // For min=1 the first-char guarantees count >= 1; for min > 1 verify here. + if (elem.min > 1) { + mv.visitVarInsn(ILOAD, matchCountVar); + pushInt(mv, elem.min); + Label minSatisfied = new Label(); + mv.visitJumpInsn(IF_ICMPGE, minSatisfied); + mv.visitInsn(ICONST_0); + mv.visitInsn(IRETURN); + mv.visitLabel(minSatisfied); + } } else { // elem.min == 0 (optional quantifier like {0,3}) @@ -1045,7 +1054,12 @@ private int generateBoundedQuantifierMatchingWithFail( mv.visitLabel(loopEnd); - // Minimum already satisfied (matchCount >= 1), no check needed + // For min=1 the first-char guarantees count >= 1; for min > 1 verify here. + if (elem.min > 1) { + mv.visitVarInsn(ILOAD, matchCountVar); + pushInt(mv, elem.min); + mv.visitJumpInsn(IF_ICMPLT, matchFailed); + } } else { // elem.min == 0 (optional quantifier like {0,3}) diff --git a/reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/codegen/DFASwitchBytecodeGenerator.java b/reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/codegen/DFASwitchBytecodeGenerator.java index b2e8482..688dcfb 100644 --- a/reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/codegen/DFASwitchBytecodeGenerator.java +++ b/reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/codegen/DFASwitchBytecodeGenerator.java @@ -22,6 +22,7 @@ import com.datadoghq.reggie.codegen.automaton.CharSet; import com.datadoghq.reggie.codegen.automaton.DFA; import com.datadoghq.reggie.codegen.automaton.NFA; +import java.util.EnumSet; import java.util.Map; import org.objectweb.asm.ClassWriter; import org.objectweb.asm.Label; @@ -213,9 +214,10 @@ public void generateMatchesMethod(ClassWriter cw, String className) { // Note: pos has been incremented, so pass posVar for assertion checking generateStateSwitch(mv, stateVar, chVar, posVar, loopStart, allocator); - // End of input - check if in accept state + // End of input - check if in accept state, including zero-width assertions attached to the + // accepting DFA state. mv.visitLabel(loopEnd); - generateAcceptCheck(mv, stateVar); + generateAcceptCheckWithAssertions(mv, stateVar, posVar, allocator); mv.visitMaxs(0, 0); mv.visitEnd(); @@ -277,7 +279,7 @@ private void generateStateCaseCode( // Check assertions BEFORE character transitions (critical for correctness) // Note: posVar is AFTER pos++ (incremented in main loop before switch) // generateAssertionCheck handles the position adjustment internally - if (!state.assertionChecks.isEmpty()) { + if (!state.accepting && !state.assertionChecks.isEmpty()) { Label assertionFailed = new Label(); Label assertionsPassed = new Label(); @@ -327,162 +329,13 @@ private void generateAssertionCheck( int posVar, Label assertionFailed, LocalVarAllocator allocator) { - if (assertion.isLookahead()) { - // Lookahead: peek forward from current position (posVar - 1) - String literal = assertion.literal; - - if (assertion.isPositive()) { - // Positive lookahead: Check all chars match - for (int i = 0; i < literal.length(); i++) { - char expectedChar = literal.charAt(i); - int peekOffset = assertion.offset + i; - - // Bounds check: if ((pos - 1) + peekOffset >= input.length()) fail - mv.visitVarInsn(ILOAD, posVar); // Load pos (already incremented) - mv.visitInsn(ICONST_M1); // Load -1 - mv.visitInsn(IADD); // pos - 1 - pushInt(mv, peekOffset); // Load peekOffset - mv.visitInsn(IADD); // (pos - 1) + peekOffset - mv.visitVarInsn(ALOAD, 1); // input - mv.visitMethodInsn(INVOKEVIRTUAL, "java/lang/String", "length", "()I", false); - mv.visitJumpInsn(IF_ICMPGE, assertionFailed); - - // Character check: if (input.charAt((pos - 1) + peekOffset) != expected) fail - mv.visitVarInsn(ALOAD, 1); // input - mv.visitVarInsn(ILOAD, posVar); // Load pos - mv.visitInsn(ICONST_M1); // Load -1 - mv.visitInsn(IADD); // pos - 1 - pushInt(mv, peekOffset); // Load peekOffset - mv.visitInsn(IADD); // (pos - 1) + peekOffset - mv.visitMethodInsn(INVOKEVIRTUAL, "java/lang/String", "charAt", "(I)C", false); - pushInt(mv, (int) expectedChar); - mv.visitJumpInsn(IF_ICMPNE, assertionFailed); - } - } else { - // Negative lookahead: Check if pattern is NOT present - // If ANY char doesn't match, assertion succeeds (skip to end) - // If ALL chars match, assertion fails - Label assertionPassed = new Label(); - - for (int i = 0; i < literal.length(); i++) { - char expectedChar = literal.charAt(i); - int peekOffset = assertion.offset + i; - - // Bounds check: if ((pos - 1) + peekOffset >= input.length()) - // Pattern not present (out of bounds), so negative assertion succeeds - mv.visitVarInsn(ILOAD, posVar); // Load pos (already incremented) - mv.visitInsn(ICONST_M1); // Load -1 - mv.visitInsn(IADD); // pos - 1 - pushInt(mv, peekOffset); // Load peekOffset - mv.visitInsn(IADD); // (pos - 1) + peekOffset - mv.visitVarInsn(ALOAD, 1); // input - mv.visitMethodInsn(INVOKEVIRTUAL, "java/lang/String", "length", "()I", false); - mv.visitJumpInsn(IF_ICMPGE, assertionPassed); - - // Character check: if (input.charAt((pos - 1) + peekOffset) != expected) - // Pattern doesn't match, so negative assertion succeeds - mv.visitVarInsn(ALOAD, 1); // input - mv.visitVarInsn(ILOAD, posVar); // Load pos - mv.visitInsn(ICONST_M1); // Load -1 - mv.visitInsn(IADD); // pos - 1 - pushInt(mv, peekOffset); // Load peekOffset - mv.visitInsn(IADD); // (pos - 1) + peekOffset - mv.visitMethodInsn(INVOKEVIRTUAL, "java/lang/String", "charAt", "(I)C", false); - pushInt(mv, (int) expectedChar); - mv.visitJumpInsn(IF_ICMPNE, assertionPassed); - } - - // All chars matched - pattern IS present, negative assertion fails - mv.visitJumpInsn(GOTO, assertionFailed); - - // Pattern not present - negative assertion succeeds - mv.visitLabel(assertionPassed); - } - } else if (assertion.isLookbehind()) { - // Lookbehind: check backward from current position (posVar - 1) - int width = assertion.width; - - // Calculate checkPos = (pos - 1) - width - int checkPosVar = allocator.allocate(); - mv.visitVarInsn(ILOAD, posVar); // Load pos (already incremented) - mv.visitInsn(ICONST_M1); // Load -1 - mv.visitInsn(IADD); // pos - 1 - pushInt(mv, width); // Load width - mv.visitInsn(ISUB); // (pos - 1) - width - mv.visitVarInsn(ISTORE, checkPosVar); - - // Bounds check: if (checkPos < 0) - mv.visitVarInsn(ILOAD, checkPosVar); - Label boundsOk = new Label(); - Label assertionPassed = new Label(); - mv.visitJumpInsn(IFGE, boundsOk); // If >= 0, bounds OK - - // Can't look back far enough - if (assertion.isPositive()) { - // Positive lookbehind: fail (required pattern not present) - mv.visitJumpInsn(GOTO, assertionFailed); - } else { - // Negative lookbehind: succeed (no pattern to match against) - // Skip character checking - assertion is satisfied - mv.visitJumpInsn(GOTO, assertionPassed); - } - - mv.visitLabel(boundsOk); - - if (assertion.isLiteral) { - // Literal assertion: use String.regionMatches() - mv.visitVarInsn(ALOAD, 1); // Load input string - mv.visitVarInsn(ILOAD, checkPosVar); // Load checkPos (toffset) - mv.visitLdcInsn(assertion.literal); // Load literal string (other) - mv.visitInsn(ICONST_0); // ooffset = 0 - pushInt(mv, assertion.literal.length()); // len - mv.visitMethodInsn( - INVOKEVIRTUAL, "java/lang/String", "regionMatches", "(ILjava/lang/String;II)Z", false); - - // Stack now has boolean result (1 = match, 0 = no match) - if (assertion.isPositive()) { - mv.visitJumpInsn(IFEQ, assertionFailed); - } else { - mv.visitJumpInsn(IFEQ, assertionPassed); - mv.visitJumpInsn(GOTO, assertionFailed); - mv.visitLabel(assertionPassed); - } - } else { - // CharSet sequence assertion - Label mismatch = new Label(); - int chVar = allocator.allocate(); - - for (int i = 0; i < assertion.charSets.size(); i++) { - mv.visitVarInsn(ALOAD, 1); // input - mv.visitVarInsn(ILOAD, checkPosVar); - if (i > 0) { - pushInt(mv, i); - mv.visitInsn(IADD); - } - mv.visitMethodInsn(INVOKEVIRTUAL, "java/lang/String", "charAt", "(I)C", false); - mv.visitVarInsn(ISTORE, chVar); - - generateCharSetCheck(mv, assertion.charSets.get(i), chVar, mismatch); - } - - // All matched - if (assertion.isPositive()) { - // Positive: success - } else { - mv.visitJumpInsn(GOTO, assertionFailed); - } - mv.visitJumpInsn(GOTO, assertionPassed); - - mv.visitLabel(mismatch); - if (assertion.isPositive()) { - mv.visitJumpInsn(GOTO, assertionFailed); - } - mv.visitLabel(assertionPassed); - } - // Positive assertion continues here (already passed) - } else { - throw new IllegalStateException("Unknown assertion type: " + assertion.type); - } + int assertionPosVar = allocator.allocate(); + mv.visitVarInsn(ILOAD, posVar); + mv.visitInsn(ICONST_M1); + mv.visitInsn(IADD); + mv.visitVarInsn(ISTORE, assertionPosVar); + generateAssertionCheckAtCurrentPosition( + mv, assertion, assertionPosVar, assertionFailed, allocator); } /** Generate inline character checks (same as unrolled version). */ @@ -575,6 +428,222 @@ private void generateAcceptCheck(MethodVisitor mv, int stateVar) { } } + /** Generate code to check if current state is accepting, including per-state assertions. */ + private void generateAcceptCheckWithAssertions( + MethodVisitor mv, int stateVar, int posVar, LocalVarAllocator allocator) { + Label notAccepting = new Label(); + for (DFA.DFAState acceptState : dfa.getAcceptStates()) { + Label checkNext = new Label(); + mv.visitVarInsn(ILOAD, stateVar); + pushInt(mv, acceptState.id); + mv.visitJumpInsn(IF_ICMPNE, checkNext); + + for (AssertionCheck assertion : acceptState.assertionChecks) { + generateAssertionCheckAtCurrentPosition(mv, assertion, posVar, checkNext, allocator); + } + if (!acceptState.acceptanceAnchorConditions.isEmpty()) { + emitAcceptanceAnchorChecks(mv, acceptState.acceptanceAnchorConditions, posVar, checkNext); + } + mv.visitInsn(ICONST_1); + mv.visitInsn(IRETURN); + mv.visitLabel(checkNext); + } + mv.visitLabel(notAccepting); + mv.visitInsn(ICONST_0); + mv.visitInsn(IRETURN); + } + + /** + * Generate an assertion check at the current DFA position. This is used when an accepting state + * is reached between character transitions (for find/matches end checks). The existing + * generateAssertionCheck method is for transition-time checks after pos++ and therefore evaluates + * at pos-1. + */ + private void generateAssertionCheckAtCurrentPosition( + MethodVisitor mv, + AssertionCheck assertion, + int posVar, + Label assertionFailed, + LocalVarAllocator allocator) { + if (assertion.isLookahead()) { + if (assertion.isLiteral) { + generateLiteralLookaheadAtCurrentPosition( + mv, assertion, posVar, assertionFailed, allocator); + } else { + generateCharSetLookaheadAtCurrentPosition( + mv, assertion, posVar, assertionFailed, allocator); + } + } else if (assertion.isLookbehind()) { + generateLookbehindAtCurrentPosition(mv, assertion, posVar, assertionFailed, allocator); + } else { + throw new IllegalStateException("Unknown assertion type: " + assertion.type); + } + } + + private void generateLiteralLookaheadAtCurrentPosition( + MethodVisitor mv, + AssertionCheck assertion, + int posVar, + Label assertionFailed, + LocalVarAllocator allocator) { + String literal = assertion.literal; + if (assertion.isPositive()) { + mv.visitVarInsn(ILOAD, posVar); + pushInt(mv, assertion.offset + literal.length()); + mv.visitInsn(IADD); + mv.visitVarInsn(ALOAD, 1); + mv.visitMethodInsn(INVOKEVIRTUAL, "java/lang/String", "length", "()I", false); + mv.visitJumpInsn(IF_ICMPGT, assertionFailed); + + int checkPosVar = allocator.allocate(); + for (int i = 0; i < literal.length(); i++) { + mv.visitVarInsn(ILOAD, posVar); + pushInt(mv, assertion.offset + i); + mv.visitInsn(IADD); + mv.visitVarInsn(ISTORE, checkPosVar); + mv.visitVarInsn(ALOAD, 1); + mv.visitVarInsn(ILOAD, checkPosVar); + mv.visitMethodInsn(INVOKEVIRTUAL, "java/lang/String", "charAt", "(I)C", false); + pushInt(mv, literal.charAt(i)); + mv.visitJumpInsn(IF_ICMPNE, assertionFailed); + } + } else { + Label assertionPassed = new Label(); + int checkPosVar = allocator.allocate(); + for (int i = 0; i < literal.length(); i++) { + mv.visitVarInsn(ILOAD, posVar); + pushInt(mv, assertion.offset + i); + mv.visitInsn(IADD); + mv.visitVarInsn(ISTORE, checkPosVar); + mv.visitVarInsn(ILOAD, checkPosVar); + mv.visitVarInsn(ALOAD, 1); + mv.visitMethodInsn(INVOKEVIRTUAL, "java/lang/String", "length", "()I", false); + mv.visitJumpInsn(IF_ICMPGE, assertionPassed); + mv.visitVarInsn(ALOAD, 1); + mv.visitVarInsn(ILOAD, checkPosVar); + mv.visitMethodInsn(INVOKEVIRTUAL, "java/lang/String", "charAt", "(I)C", false); + pushInt(mv, literal.charAt(i)); + mv.visitJumpInsn(IF_ICMPNE, assertionPassed); + } + mv.visitJumpInsn(GOTO, assertionFailed); + mv.visitLabel(assertionPassed); + } + } + + private void generateCharSetLookaheadAtCurrentPosition( + MethodVisitor mv, + AssertionCheck assertion, + int posVar, + Label assertionFailed, + LocalVarAllocator allocator) { + if (assertion.isPositive()) { + mv.visitVarInsn(ILOAD, posVar); + pushInt(mv, assertion.offset + assertion.charSets.size()); + mv.visitInsn(IADD); + mv.visitVarInsn(ALOAD, 1); + mv.visitMethodInsn(INVOKEVIRTUAL, "java/lang/String", "length", "()I", false); + mv.visitJumpInsn(IF_ICMPGT, assertionFailed); + + int checkPosVar = allocator.allocate(); + int chVar = allocator.allocate(); + for (int i = 0; i < assertion.charSets.size(); i++) { + mv.visitVarInsn(ILOAD, posVar); + pushInt(mv, assertion.offset + i); + mv.visitInsn(IADD); + mv.visitVarInsn(ISTORE, checkPosVar); + mv.visitVarInsn(ALOAD, 1); + mv.visitVarInsn(ILOAD, checkPosVar); + mv.visitMethodInsn(INVOKEVIRTUAL, "java/lang/String", "charAt", "(I)C", false); + mv.visitVarInsn(ISTORE, chVar); + generateCharSetCheck(mv, assertion.charSets.get(i), chVar, assertionFailed); + } + } else { + Label assertionPassed = new Label(); + int checkPosVar = allocator.allocate(); + int chVar = allocator.allocate(); + for (int i = 0; i < assertion.charSets.size(); i++) { + mv.visitVarInsn(ILOAD, posVar); + pushInt(mv, assertion.offset + i); + mv.visitInsn(IADD); + mv.visitVarInsn(ISTORE, checkPosVar); + mv.visitVarInsn(ILOAD, checkPosVar); + mv.visitVarInsn(ALOAD, 1); + mv.visitMethodInsn(INVOKEVIRTUAL, "java/lang/String", "length", "()I", false); + mv.visitJumpInsn(IF_ICMPGE, assertionPassed); + mv.visitVarInsn(ALOAD, 1); + mv.visitVarInsn(ILOAD, checkPosVar); + mv.visitMethodInsn(INVOKEVIRTUAL, "java/lang/String", "charAt", "(I)C", false); + mv.visitVarInsn(ISTORE, chVar); + generateCharSetCheck(mv, assertion.charSets.get(i), chVar, assertionPassed); + } + mv.visitJumpInsn(GOTO, assertionFailed); + mv.visitLabel(assertionPassed); + } + } + + private void generateLookbehindAtCurrentPosition( + MethodVisitor mv, + AssertionCheck assertion, + int posVar, + Label assertionFailed, + LocalVarAllocator allocator) { + int checkPosVar = allocator.allocate(); + mv.visitVarInsn(ILOAD, posVar); + pushInt(mv, assertion.width); + mv.visitInsn(ISUB); + mv.visitVarInsn(ISTORE, checkPosVar); + + mv.visitVarInsn(ILOAD, checkPosVar); + Label boundsOk = new Label(); + Label assertionPassed = new Label(); + mv.visitJumpInsn(IFGE, boundsOk); + if (assertion.isPositive()) { + mv.visitJumpInsn(GOTO, assertionFailed); + } else { + mv.visitJumpInsn(GOTO, assertionPassed); + } + mv.visitLabel(boundsOk); + + if (assertion.isLiteral) { + mv.visitVarInsn(ALOAD, 1); + mv.visitVarInsn(ILOAD, checkPosVar); + mv.visitLdcInsn(assertion.literal); + mv.visitInsn(ICONST_0); + pushInt(mv, assertion.literal.length()); + mv.visitMethodInsn( + INVOKEVIRTUAL, "java/lang/String", "regionMatches", "(ILjava/lang/String;II)Z", false); + if (assertion.isPositive()) { + mv.visitJumpInsn(IFEQ, assertionFailed); + } else { + mv.visitJumpInsn(IFEQ, assertionPassed); + mv.visitJumpInsn(GOTO, assertionFailed); + } + } else { + Label mismatch = new Label(); + int chVar = allocator.allocate(); + for (int i = 0; i < assertion.charSets.size(); i++) { + mv.visitVarInsn(ALOAD, 1); + mv.visitVarInsn(ILOAD, checkPosVar); + if (i > 0) { + pushInt(mv, i); + mv.visitInsn(IADD); + } + mv.visitMethodInsn(INVOKEVIRTUAL, "java/lang/String", "charAt", "(I)C", false); + mv.visitVarInsn(ISTORE, chVar); + generateCharSetCheck(mv, assertion.charSets.get(i), chVar, mismatch); + } + if (!assertion.isPositive()) { + mv.visitJumpInsn(GOTO, assertionFailed); + } + mv.visitJumpInsn(GOTO, assertionPassed); + mv.visitLabel(mismatch); + if (assertion.isPositive()) { + mv.visitJumpInsn(GOTO, assertionFailed); + } + } + mv.visitLabel(assertionPassed); + } + /** * Generates find() method - delegates to findFrom(input, 0). * @@ -709,10 +778,12 @@ public void generateFindFromMethod(ClassWriter cw, String className) { mv.visitVarInsn(ILOAD, lenVar); mv.visitJumpInsn(IF_ICMPGE, outerLoopEnd); - // ANCHOR OPTIMIZATION: Skip positions that can't match due to anchors - // Use requiresStartAnchor (not hasStartAnchor) to handle alternations like (^foo|bar) - // where one branch has anchor but pattern can still match at any position via other branch - if (requiresStartAnchor || hasStringStartAnchor) { + // ANCHOR OPTIMIZATION: Skip positions that can't match due to anchors. + // {@link NFA#requiresStartAnchor()} treats both START (^) and STRING_START (\A) as barriers, + // so it returns true only when ALL paths to a useful target go through one of them. Or-ing + // in {@code hasStringStartAnchor} on top short-circuits on patterns like `]\A|b` where only + // one branch has \A but the other can still match anywhere. + if (requiresStartAnchor) { // Non-multiline ^ or \A: Only try position 0 // if (tryPos != 0) return -1; Label validPosition = new Label(); @@ -925,89 +996,23 @@ public void generateMatchesAtStartMethod(ClassWriter cw) { mv.visitInsn(IRETURN); mv.visitLabel(notNull); - // Check if start state is accepting + // Check if start state is accepting (per-state assertions/anchor conditions gate the + // empty-match path) if (dfa.getStartState().accepting) { - if (hasEndAnchor || hasStringEndAnchor || hasStringEndAbsoluteAnchor || hasMultilineEnd) { - // Must check end anchor before accepting empty match - Label continueMatching = new Label(); - - // Get current position (startPos) and length - mv.visitVarInsn(ILOAD, 2); // startPos - mv.visitVarInsn(ALOAD, 1); - mv.visitMethodInsn(INVOKEVIRTUAL, "java/lang/String", "length", "()I", false); - - if (hasStringEndAbsoluteAnchor) { - // \z: Accept only if startPos == length - mv.visitJumpInsn(IF_ICMPNE, continueMatching); - mv.visitInsn(ICONST_1); - mv.visitInsn(IRETURN); - mv.visitLabel(continueMatching); - } else if (hasStringEndAnchor) { - // \Z: Accept if startPos == length OR (startPos == length-1 AND charAt(startPos) == '\n') - Label accept = new Label(); - - // Stack: startPos, length - mv.visitInsn(DUP2); - mv.visitJumpInsn(IF_ICMPEQ, accept); // If startPos == length, accept - - // Check if startPos == length-1 - mv.visitInsn(DUP2); - mv.visitInsn(SWAP); - mv.visitInsn(ICONST_1); - mv.visitInsn(ISUB); - mv.visitInsn(SWAP); - mv.visitJumpInsn(IF_ICMPNE, continueMatching); // startPos != length-1 - - // Check charAt(startPos) == '\n' - mv.visitInsn(POP); // Remove length - mv.visitVarInsn(ALOAD, 1); - mv.visitInsn(SWAP); - mv.visitMethodInsn(INVOKEVIRTUAL, "java/lang/String", "charAt", "(I)C", false); - pushInt(mv, '\n'); - mv.visitJumpInsn(IF_ICMPNE, continueMatching); - - mv.visitLabel(accept); - mv.visitInsn(ICONST_1); - mv.visitInsn(IRETURN); - mv.visitLabel(continueMatching); - } else if (hasEndAnchor) { - // Non-multiline $: Accept only if startPos == length - mv.visitJumpInsn(IF_ICMPNE, continueMatching); - mv.visitInsn(ICONST_1); - mv.visitInsn(IRETURN); - mv.visitLabel(continueMatching); - } else { // hasMultilineEnd - // Multiline $: Accept if startPos == length OR charAt(startPos) == '\n' - Label accept = new Label(); - - // Stack: startPos, length - mv.visitInsn(DUP2); - mv.visitJumpInsn(IF_ICMPEQ, accept); // If startPos == length, accept - - // Check if startPos < length && charAt(startPos) == '\n' - mv.visitInsn(DUP2); - mv.visitJumpInsn(IF_ICMPGE, continueMatching); // startPos >= length, can't check charAt - - // Stack: startPos, length - mv.visitInsn(POP); // Stack: startPos - mv.visitVarInsn(ALOAD, 1); - mv.visitInsn(SWAP); - mv.visitMethodInsn(INVOKEVIRTUAL, "java/lang/String", "charAt", "(I)C", false); - pushInt(mv, '\n'); - mv.visitJumpInsn(IF_ICMPNE, continueMatching); - - mv.visitLabel(accept); - mv.visitInsn(ICONST_1); - mv.visitInsn(IRETURN); - - mv.visitLabel(continueMatching); - // Anchor not met, continue to main matching loop - } + Label continueMatching = new Label(); + for (AssertionCheck assertion : dfa.getStartState().assertionChecks) { + generateAssertionCheckAtCurrentPosition(mv, assertion, 2, continueMatching, allocator); + } + if (dfa.getStartState().acceptanceAnchorConditions.isEmpty()) { + mv.visitInsn(ICONST_1); + mv.visitInsn(IRETURN); } else { - // No end anchor - accept immediately + emitAcceptanceAnchorChecks( + mv, dfa.getStartState().acceptanceAnchorConditions, 2, continueMatching); mv.visitInsn(ICONST_1); mv.visitInsn(IRETURN); } + mv.visitLabel(continueMatching); } // int state = 0; (start state) @@ -1036,53 +1041,24 @@ public void generateMatchesAtStartMethod(ClassWriter cw) { pushInt(mv, acceptState.id); mv.visitJumpInsn(IF_ICMPNE, checkNext); - // Found accepting state - but must check end anchor if present - if (hasEndAnchor || hasStringEndAnchor || hasStringEndAbsoluteAnchor || hasMultilineEnd) { - Label continueMatching = new Label(); - - // Get current position and length - mv.visitVarInsn(ILOAD, posVar); - mv.visitVarInsn(ALOAD, 1); - mv.visitMethodInsn(INVOKEVIRTUAL, "java/lang/String", "length", "()I", false); - - if (hasEndAnchor) { - // Non-multiline $: Accept only if pos == length - mv.visitJumpInsn(IF_ICMPNE, continueMatching); // If pos != length, continue matching - mv.visitInsn(ICONST_1); - mv.visitInsn(IRETURN); - } else { // hasMultilineEnd - // Multiline $: Accept if pos == length OR charAt(pos) == '\n' - Label accept = new Label(); - - // Stack: pos, length - mv.visitInsn(DUP2); // Stack: pos, length, pos, length - mv.visitJumpInsn(IF_ICMPEQ, accept); // If pos == length, accept - - // Stack: pos, length - // Check if pos < length && charAt(pos) == '\n' - mv.visitInsn(DUP2); // Stack: pos, length, pos, length - mv.visitJumpInsn(IF_ICMPGE, continueMatching); // pos >= length, can't check charAt - - // Stack: pos, length - mv.visitInsn(POP); // Stack: pos - mv.visitVarInsn(ALOAD, 1); - mv.visitInsn(SWAP); // Stack: input, pos - mv.visitMethodInsn(INVOKEVIRTUAL, "java/lang/String", "charAt", "(I)C", false); - pushInt(mv, '\n'); - mv.visitJumpInsn(IF_ICMPNE, continueMatching); // Not '\n', continue matching - - mv.visitLabel(accept); - mv.visitInsn(ICONST_1); - mv.visitInsn(IRETURN); - } - - mv.visitLabel(continueMatching); - // Anchor condition not met, continue to checkNext + // Found accepting state — gate acceptance on per-state assertions/anchor conditions. If a + // lookahead assertion fails, keep consuming via outgoing transitions instead of rejecting the + // whole start position; a later, longer prefix may satisfy the assertion. + Label continueMatching = new Label(); + for (AssertionCheck assertion : acceptState.assertionChecks) { + generateAssertionCheckAtCurrentPosition(mv, assertion, posVar, continueMatching, allocator); + } + if (acceptState.acceptanceAnchorConditions.isEmpty()) { + mv.visitInsn(ICONST_1); + mv.visitInsn(IRETURN); } else { - // No end anchor - accept immediately + emitAcceptanceAnchorChecks( + mv, acceptState.acceptanceAnchorConditions, posVar, continueMatching); mv.visitInsn(ICONST_1); mv.visitInsn(IRETURN); } + mv.visitLabel(continueMatching); + mv.visitJumpInsn(GOTO, notAccepting); mv.visitLabel(checkNext); } @@ -1108,9 +1084,10 @@ public void generateMatchesAtStartMethod(ClassWriter cw) { // Note: pos has been incremented, so pass posVar for assertion checking generateStateSwitch(mv, stateVar, chVar, posVar, loopStart, allocator); - // End of input - check if accepting + // End of input - check if accepting, including zero-width assertions attached to the accepting + // DFA state. mv.visitLabel(loopEnd); - generateAcceptCheck(mv, stateVar); + generateAcceptCheckWithAssertions(mv, stateVar, posVar, allocator); mv.visitMaxs(0, 0); mv.visitEnd(); @@ -1216,6 +1193,64 @@ public void generateMatchMethod(ClassWriter cw, String className) { mv.visitEnd(); } + /** + * Generates matchInto(String, int[], int[]) method. This is the allocation-free equivalent of + * match(String): it writes group 0 and capture group boundaries into caller-provided arrays. + */ + public void generateMatchIntoMethod(ClassWriter cw, String className) { + MethodVisitor mv = + cw.visitMethod(ACC_PUBLIC, "matchInto", "(Ljava/lang/String;[I[I)Z", null, null); + mv.visitCode(); + + if (groupCount > 0 && hasGroupActions()) { + generateMatchIntoWithGroupTracking(mv); + mv.visitMaxs(0, 0); + mv.visitEnd(); + return; + } + + // if (!matches(input)) return false; + Label matchSuccess = new Label(); + mv.visitVarInsn(ALOAD, 0); + mv.visitVarInsn(ALOAD, 1); + mv.visitMethodInsn( + INVOKEVIRTUAL, className.replace('.', '/'), "matches", "(Ljava/lang/String;)Z", false); + mv.visitJumpInsn(IFNE, matchSuccess); + mv.visitInsn(ICONST_0); + mv.visitInsn(IRETURN); + + mv.visitLabel(matchSuccess); + + // groupStarts[0] = 0; groupEnds[0] = input.length(); + mv.visitVarInsn(ALOAD, 2); + mv.visitInsn(ICONST_0); + mv.visitInsn(ICONST_0); + mv.visitInsn(IASTORE); + mv.visitVarInsn(ALOAD, 3); + mv.visitInsn(ICONST_0); + mv.visitVarInsn(ALOAD, 1); + mv.visitMethodInsn(INVOKEVIRTUAL, "java/lang/String", "length", "()I", false); + mv.visitInsn(IASTORE); + + // Initialize inner groups to -1 when the DFA has group slots but no group actions. + for (int i = 1; i <= groupCount; i++) { + mv.visitVarInsn(ALOAD, 2); + pushInt(mv, i); + mv.visitInsn(ICONST_M1); + mv.visitInsn(IASTORE); + + mv.visitVarInsn(ALOAD, 3); + pushInt(mv, i); + mv.visitInsn(ICONST_M1); + mv.visitInsn(IASTORE); + } + + mv.visitInsn(ICONST_1); + mv.visitInsn(IRETURN); + mv.visitMaxs(0, 0); + mv.visitEnd(); + } + /** * Generates matchesBounded() method - boolean bounded matching (allocation-free). Uses * switch-based DFA on the bounded region [start, end). @@ -1309,60 +1344,23 @@ public void generateMatchesBoundedMethod(ClassWriter cw, String className) { // End of input - check if in accept state mv.visitLabel(loopEnd); - // Check if state is accepting + // Check if state is accepting (bounded path — per-state conditions, CharSequence helper) for (DFA.DFAState acceptState : dfa.getAcceptStates()) { Label checkNext = new Label(); mv.visitVarInsn(ILOAD, stateVar); pushInt(mv, acceptState.id); mv.visitJumpInsn(IF_ICMPNE, checkNext); - // Found accepting state - but must check end anchor if present - if (hasEndAnchor || hasStringEndAnchor || hasStringEndAbsoluteAnchor || hasMultilineEnd) { - Label continueChecking = new Label(); - - // At loopEnd, pos == end (we've consumed the bounded region) - // For end anchor, check if end satisfies anchor condition - mv.visitVarInsn(ILOAD, posVar); - mv.visitVarInsn(ALOAD, 1); - mv.visitMethodInsn(INVOKEINTERFACE, "java/lang/CharSequence", "length", "()I", true); - - if (hasEndAnchor) { - // Non-multiline $: Accept only if pos == length - mv.visitJumpInsn(IF_ICMPNE, continueChecking); - mv.visitInsn(ICONST_1); - mv.visitInsn(IRETURN); - mv.visitLabel(continueChecking); - } else { // hasMultilineEnd - // Multiline $: Accept if pos == length OR charAt(pos) == '\n' - Label accept = new Label(); - - // Stack: pos, length - mv.visitInsn(DUP2); - mv.visitJumpInsn(IF_ICMPEQ, accept); // If pos == length, accept - - // Check if pos < length && charAt(pos) == '\n' - mv.visitInsn(DUP2); - mv.visitJumpInsn(IF_ICMPGE, continueChecking); // pos >= length, can't check charAt - - // Stack: pos, length - mv.visitInsn(POP); // Stack: pos - mv.visitVarInsn(ALOAD, 1); - mv.visitInsn(SWAP); - mv.visitMethodInsn(INVOKEINTERFACE, "java/lang/CharSequence", "charAt", "(I)C", true); - pushInt(mv, '\n'); - mv.visitJumpInsn(IF_ICMPNE, continueChecking); - - mv.visitLabel(accept); - mv.visitInsn(ICONST_1); - mv.visitInsn(IRETURN); - - mv.visitLabel(continueChecking); - // Anchor not met, check next state - } + if (acceptState.acceptanceAnchorConditions.isEmpty()) { + mv.visitInsn(ICONST_1); + mv.visitInsn(IRETURN); } else { - // No end anchor - accept immediately + Label continueChecking = new Label(); + emitAcceptanceAnchorChecksCharSequence( + mv, acceptState.acceptanceAnchorConditions, posVar, continueChecking); mv.visitInsn(ICONST_1); mv.visitInsn(IRETURN); + mv.visitLabel(continueChecking); } mv.visitLabel(checkNext); @@ -1543,42 +1541,10 @@ public void generateMatchBoundedMethod(ClassWriter cw, String className) { pushInt(mv, acceptState.id); mv.visitJumpInsn(IF_ICMPNE, checkNext); - // Found accepting state - but must check end anchor if present - if (hasEndAnchor || hasStringEndAnchor || hasStringEndAbsoluteAnchor || hasMultilineEnd) { - Label createResult = new Label(); - - // At loopEnd, pos == end (we've consumed the bounded region) - // For end anchor, check if end satisfies anchor condition - mv.visitVarInsn(ILOAD, posVar); - mv.visitVarInsn(ALOAD, 1); - mv.visitMethodInsn(INVOKEINTERFACE, "java/lang/CharSequence", "length", "()I", true); - - if (hasEndAnchor) { - // Non-multiline $: Accept only if pos == length - mv.visitJumpInsn(IF_ICMPNE, checkNext); // Anchor not met, check next state - // Fall through to createResult - } else { // hasMultilineEnd - // Multiline $: Accept if pos == length OR charAt(pos) == '\n' - Label checkNewline = new Label(); - - // Stack: pos, length - mv.visitInsn(DUP2); - mv.visitJumpInsn(IF_ICMPEQ, createResult); // If pos == length, accept - - // Check if pos < length && charAt(pos) == '\n' - mv.visitInsn(DUP2); - mv.visitJumpInsn(IF_ICMPGE, checkNext); // pos >= length, can't check charAt - - // Stack: pos, length - mv.visitInsn(POP); // Stack: pos - mv.visitVarInsn(ALOAD, 1); - mv.visitInsn(SWAP); - mv.visitMethodInsn(INVOKEINTERFACE, "java/lang/CharSequence", "charAt", "(I)C", true); - pushInt(mv, '\n'); - mv.visitJumpInsn(IF_ICMPNE, checkNext); // Not '\n', check next state - } - - mv.visitLabel(createResult); + // Found accepting state — gate acceptance on per-state anchor conditions (CharSequence path). + if (!acceptState.acceptanceAnchorConditions.isEmpty()) { + emitAcceptanceAnchorChecksCharSequence( + mv, acceptState.acceptanceAnchorConditions, posVar, checkNext); } // Accepting state - create MatchResult @@ -2315,6 +2281,86 @@ private void generateMatchWithGroupTracking(MethodVisitor mv, String className) mv.visitInsn(ARETURN); } + /** Generate matchInto() method body with inline group tracking during DFA execution. */ + private void generateMatchIntoWithGroupTracking(MethodVisitor mv) { + // Slots: 0=this, 1=input, 2=groupStarts, 3=groupEnds + LocalVarAllocator allocator = new LocalVarAllocator(4); + + // if (input == null) return false; + Label notNull = new Label(); + mv.visitVarInsn(ALOAD, 1); + mv.visitJumpInsn(IFNONNULL, notNull); + mv.visitInsn(ICONST_0); + mv.visitInsn(IRETURN); + mv.visitLabel(notNull); + + int stateVar = allocator.allocate(); + pushInt(mv, dfa.getStartState().id); + mv.visitVarInsn(ISTORE, stateVar); + + int posVar = allocator.allocate(); + mv.visitInsn(ICONST_0); + mv.visitVarInsn(ISTORE, posVar); + + int chVar = allocator.allocate(); + + // Initialize group 0 and inner groups in the caller-provided arrays. + mv.visitVarInsn(ALOAD, 2); + mv.visitInsn(ICONST_0); + mv.visitInsn(ICONST_0); + mv.visitInsn(IASTORE); + + for (int i = 1; i <= groupCount; i++) { + mv.visitVarInsn(ALOAD, 2); + pushInt(mv, i); + mv.visitInsn(ICONST_M1); + mv.visitInsn(IASTORE); + + mv.visitVarInsn(ALOAD, 3); + pushInt(mv, i); + mv.visitInsn(ICONST_M1); + mv.visitInsn(IASTORE); + } + + generateGroupActionsForState(mv, dfa.getStartState(), posVar, 2, 3); + + Label loopStart = new Label(); + Label loopEnd = new Label(); + + mv.visitLabel(loopStart); + mv.visitVarInsn(ILOAD, posVar); + mv.visitVarInsn(ALOAD, 1); + mv.visitMethodInsn(INVOKEVIRTUAL, "java/lang/String", "length", "()I", false); + mv.visitJumpInsn(IF_ICMPGE, loopEnd); + + mv.visitVarInsn(ALOAD, 1); + mv.visitVarInsn(ILOAD, posVar); + mv.visitMethodInsn(INVOKEVIRTUAL, "java/lang/String", "charAt", "(I)C", false); + mv.visitVarInsn(ISTORE, chVar); + mv.visitIincInsn(posVar, 1); + + generateStateSwitchWithGroupTrackingReturningFalse( + mv, stateVar, chVar, posVar, 2, 3, loopStart); + + mv.visitLabel(loopEnd); + + Label isAccept = new Label(); + generateAcceptCheck(mv, stateVar, isAccept); + mv.visitInsn(ICONST_0); + mv.visitInsn(IRETURN); + + mv.visitLabel(isAccept); + + mv.visitVarInsn(ALOAD, 3); + mv.visitInsn(ICONST_0); + mv.visitVarInsn(ALOAD, 1); + mv.visitMethodInsn(INVOKEVIRTUAL, "java/lang/String", "length", "()I", false); + mv.visitInsn(IASTORE); + + mv.visitInsn(ICONST_1); + mv.visitInsn(IRETURN); + } + /** Generate switch statement with group tracking. */ private void generateStateSwitchWithGroupTracking( MethodVisitor mv, @@ -2357,6 +2403,44 @@ private void generateStateSwitchWithGroupTracking( mv.visitInsn(ARETURN); } + /** Generate switch statement with group tracking for boolean matchInto(). */ + private void generateStateSwitchWithGroupTrackingReturningFalse( + MethodVisitor mv, + int stateVar, + int chVar, + int posVar, + int groupStartsVar, + int groupEndsVar, + Label loopStart) { + Label defaultLabel = new Label(); + Label[] caseLabels = new Label[dfa.getAllStates().size()]; + + for (int i = 0; i < dfa.getAllStates().size(); i++) { + caseLabels[i] = new Label(); + } + + mv.visitVarInsn(ILOAD, stateVar); + mv.visitTableSwitchInsn(0, dfa.getAllStates().size() - 1, defaultLabel, caseLabels); + + for (DFA.DFAState state : dfa.getAllStates()) { + mv.visitLabel(caseLabels[state.id]); + generateStateCaseCodeWithGroupTracking( + mv, + state, + stateVar, + chVar, + posVar, + groupStartsVar, + groupEndsVar, + loopStart, + defaultLabel); + } + + mv.visitLabel(defaultLabel); + mv.visitInsn(ICONST_0); + mv.visitInsn(IRETURN); + } + /** Generate case code for a state with group tracking. */ private void generateStateCaseCodeWithGroupTracking( MethodVisitor mv, @@ -2437,4 +2521,156 @@ private void generateAcceptCheck(MethodVisitor mv, int stateVar, Label acceptLab mv.visitJumpInsn(IF_ICMPEQ, acceptLabel); } } + + // ============================================================================================ + // Per-state anchor condition emission. Mirrors the structure in DFAUnrolledBytecodeGenerator: + // each DFAState records the EnumSet of anchor types required at the current input position to + // be considered accepting, and each DFATransition records the anchors that must hold at the + // source position before the transition is taken. + // ============================================================================================ + + /** + * Emit a check for one anchor type at {@code posVar}. When {@code charSequence} is true the + * length/charAt calls go through {@link CharSequence} (bounded path); otherwise they target + * {@link String} (unbounded path). + */ + private void emitSingleAnchorCheck( + MethodVisitor mv, NFA.AnchorType anchor, int posVar, Label failed, boolean charSequence) { + String owner = charSequence ? "java/lang/CharSequence" : "java/lang/String"; + int invoke = charSequence ? INVOKEINTERFACE : INVOKEVIRTUAL; + boolean isIface = charSequence; + switch (anchor) { + case STRING_END_ABSOLUTE: + // \z — strict end only + mv.visitVarInsn(ILOAD, posVar); + mv.visitVarInsn(ALOAD, 1); + mv.visitMethodInsn(invoke, owner, "length", "()I", isIface); + mv.visitJumpInsn(IF_ICMPNE, failed); + break; + case START: + case STRING_START: + mv.visitVarInsn(ILOAD, posVar); + mv.visitJumpInsn(IFNE, failed); + break; + case END: + // $ (non-multiline) matches at end OR before final '\n' — same as \Z. + // Fall through to STRING_END. + case STRING_END: + { + // OK iff pos == end OR (pos == end - 1 AND charAt(pos) == '\n') + Label ok = new Label(); + mv.visitVarInsn(ILOAD, posVar); + mv.visitVarInsn(ALOAD, 1); + mv.visitMethodInsn(invoke, owner, "length", "()I", isIface); + mv.visitJumpInsn(IF_ICMPEQ, ok); + mv.visitVarInsn(ILOAD, posVar); + mv.visitVarInsn(ALOAD, 1); + mv.visitMethodInsn(invoke, owner, "length", "()I", isIface); + mv.visitInsn(ICONST_1); + mv.visitInsn(ISUB); + mv.visitJumpInsn(IF_ICMPNE, failed); + mv.visitVarInsn(ALOAD, 1); + mv.visitVarInsn(ILOAD, posVar); + mv.visitMethodInsn(invoke, owner, "charAt", "(I)C", isIface); + pushInt(mv, '\n'); + mv.visitJumpInsn(IF_ICMPNE, failed); + mv.visitLabel(ok); + break; + } + case END_MULTILINE: + { + Label ok = new Label(); + mv.visitVarInsn(ILOAD, posVar); + mv.visitVarInsn(ALOAD, 1); + mv.visitMethodInsn(invoke, owner, "length", "()I", isIface); + mv.visitJumpInsn(IF_ICMPEQ, ok); + mv.visitVarInsn(ALOAD, 1); + mv.visitVarInsn(ILOAD, posVar); + mv.visitMethodInsn(invoke, owner, "charAt", "(I)C", isIface); + pushInt(mv, '\n'); + mv.visitJumpInsn(IF_ICMPNE, failed); + mv.visitLabel(ok); + break; + } + case START_MULTILINE: + { + Label ok = new Label(); + mv.visitVarInsn(ILOAD, posVar); + mv.visitJumpInsn(IFEQ, ok); + mv.visitVarInsn(ALOAD, 1); + mv.visitVarInsn(ILOAD, posVar); + mv.visitInsn(ICONST_1); + mv.visitInsn(ISUB); + mv.visitMethodInsn(invoke, owner, "charAt", "(I)C", isIface); + pushInt(mv, '\n'); + mv.visitJumpInsn(IF_ICMPNE, failed); + mv.visitLabel(ok); + break; + } + default: + break; + } + } + + /** Emit per-state acceptance condition checks (String path). */ + private void emitAcceptanceAnchorChecks( + MethodVisitor mv, EnumSet conditions, int posVar, Label failed) { + for (NFA.AnchorType anchor : conditions) { + emitSingleAnchorCheck(mv, anchor, posVar, failed, false); + } + } + + /** Emit per-state acceptance condition checks (CharSequence/bounded path). */ + private void emitAcceptanceAnchorChecksCharSequence( + MethodVisitor mv, EnumSet conditions, int posVar, Label failed) { + for (NFA.AnchorType anchor : conditions) { + emitSingleAnchorCheck(mv, anchor, posVar, failed, true); + } + } + + /** + * Emit a transition entry-guard check. {@code posVar} is the position AFTER the char was consumed + * (i.e. source pos + 1). END-class anchors in the guard are treated as dead (the transition is + * skipped) since SubsetConstructor should have pruned them at construction. + */ + private void emitTransitionEntryGuard( + MethodVisitor mv, EnumSet entryGuard, int posVar, Label skipTransition) { + for (NFA.AnchorType anchor : entryGuard) { + switch (anchor) { + case START: + case STRING_START: + mv.visitVarInsn(ILOAD, posVar); + mv.visitInsn(ICONST_1); + mv.visitJumpInsn(IF_ICMPNE, skipTransition); + break; + case START_MULTILINE: + { + Label ok = new Label(); + mv.visitVarInsn(ILOAD, posVar); + mv.visitInsn(ICONST_1); + mv.visitJumpInsn(IF_ICMPEQ, ok); + mv.visitVarInsn(ILOAD, posVar); + mv.visitInsn(ICONST_2); + mv.visitJumpInsn(IF_ICMPLT, skipTransition); + mv.visitVarInsn(ALOAD, 1); + mv.visitVarInsn(ILOAD, posVar); + mv.visitInsn(ICONST_2); + mv.visitInsn(ISUB); + mv.visitMethodInsn(INVOKEVIRTUAL, "java/lang/String", "charAt", "(I)C", false); + pushInt(mv, '\n'); + mv.visitJumpInsn(IF_ICMPNE, skipTransition); + mv.visitLabel(ok); + break; + } + case END: + case STRING_END_ABSOLUTE: + case STRING_END: + case END_MULTILINE: + mv.visitJumpInsn(GOTO, skipTransition); + break; + default: + break; + } + } + } } diff --git a/reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/codegen/DFATableBytecodeGenerator.java b/reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/codegen/DFATableBytecodeGenerator.java new file mode 100644 index 0000000..add885d --- /dev/null +++ b/reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/codegen/DFATableBytecodeGenerator.java @@ -0,0 +1,519 @@ +/* + * Copyright 2026-Present Datadog, Inc. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package com.datadoghq.reggie.codegen.codegen; + +import static com.datadoghq.reggie.codegen.codegen.BytecodeUtil.pushInt; +import static org.objectweb.asm.Opcodes.*; + +import com.datadoghq.reggie.codegen.automaton.DFA; +import com.datadoghq.reggie.codegen.automaton.DFATableData; +import org.objectweb.asm.ClassWriter; +import org.objectweb.asm.Label; +import org.objectweb.asm.MethodVisitor; + +/** Generates compact table-driven DFA bytecode for large pure regular patterns without groups. */ +public final class DFATableBytecodeGenerator { + private static final String RUNTIME = "com/datadoghq/reggie/runtime/DFATableRuntime"; + private static final int STRING_CHUNK_CHARS = 10_000; + + private final DFATableData table; + + public DFATableBytecodeGenerator(DFA dfa) { + this.table = DFATableData.from(dfa); + } + + public void generateStaticData(ClassWriter cw, String className) { + cw.visitField(ACC_PRIVATE | ACC_STATIC | ACC_FINAL, "DFA_TRANSITIONS", "[I", null, null) + .visitEnd(); + cw.visitField(ACC_PRIVATE | ACC_STATIC | ACC_FINAL, "DFA_ACCEPTING", "[Z", null, null) + .visitEnd(); + cw.visitField(ACC_PRIVATE | ACC_STATIC | ACC_FINAL, "DFA_ASCII_CLASSES", "[I", null, null) + .visitEnd(); + cw.visitField(ACC_PRIVATE | ACC_STATIC | ACC_FINAL, "DFA_START_ASCII", "[Z", null, null) + .visitEnd(); + cw.visitField(ACC_PRIVATE | ACC_STATIC | ACC_FINAL, "DFA_RANGE_STARTS", "[C", null, null) + .visitEnd(); + cw.visitField(ACC_PRIVATE | ACC_STATIC | ACC_FINAL, "DFA_RANGE_ENDS", "[C", null, null) + .visitEnd(); + cw.visitField(ACC_PRIVATE | ACC_STATIC | ACC_FINAL, "DFA_RANGE_CLASSES", "[I", null, null) + .visitEnd(); + + MethodVisitor mv = cw.visitMethod(ACC_STATIC, "", "()V", null, null); + mv.visitCode(); + + pushStringArray(mv, encodeRle(table.transitions)); + pushInt(mv, table.transitions.length); + mv.visitMethodInsn( + INVOKESTATIC, RUNTIME, "decodeRleIntArray", "([Ljava/lang/String;I)[I", false); + mv.visitFieldInsn(PUTSTATIC, className, "DFA_TRANSITIONS", "[I"); + + pushStringArray(mv, encodeBooleans(table.accepting)); + pushInt(mv, table.accepting.length); + mv.visitMethodInsn( + INVOKESTATIC, RUNTIME, "decodeBooleanArray", "([Ljava/lang/String;I)[Z", false); + mv.visitFieldInsn(PUTSTATIC, className, "DFA_ACCEPTING", "[Z"); + + pushStringArray(mv, encodeRle(table.asciiClasses)); + pushInt(mv, table.asciiClasses.length); + mv.visitMethodInsn( + INVOKESTATIC, RUNTIME, "decodeRleIntArray", "([Ljava/lang/String;I)[I", false); + mv.visitFieldInsn(PUTSTATIC, className, "DFA_ASCII_CLASSES", "[I"); + + pushStringArray(mv, encodeBooleans(table.startAscii)); + pushInt(mv, table.startAscii.length); + mv.visitMethodInsn( + INVOKESTATIC, RUNTIME, "decodeBooleanArray", "([Ljava/lang/String;I)[Z", false); + mv.visitFieldInsn(PUTSTATIC, className, "DFA_START_ASCII", "[Z"); + + pushStringArray(mv, split(new String(table.rangeStarts))); + pushInt(mv, table.rangeStarts.length); + mv.visitMethodInsn(INVOKESTATIC, RUNTIME, "decodeCharArray", "([Ljava/lang/String;I)[C", false); + mv.visitFieldInsn(PUTSTATIC, className, "DFA_RANGE_STARTS", "[C"); + + pushStringArray(mv, split(new String(table.rangeEnds))); + pushInt(mv, table.rangeEnds.length); + mv.visitMethodInsn(INVOKESTATIC, RUNTIME, "decodeCharArray", "([Ljava/lang/String;I)[C", false); + mv.visitFieldInsn(PUTSTATIC, className, "DFA_RANGE_ENDS", "[C"); + + pushStringArray(mv, encodeRle(table.rangeClasses)); + pushInt(mv, table.rangeClasses.length); + mv.visitMethodInsn( + INVOKESTATIC, RUNTIME, "decodeRleIntArray", "([Ljava/lang/String;I)[I", false); + mv.visitFieldInsn(PUTSTATIC, className, "DFA_RANGE_CLASSES", "[I"); + + mv.visitInsn(RETURN); + mv.visitMaxs(0, 0); + mv.visitEnd(); + } + + public void generateMatchesMethod(ClassWriter cw, String className) { + MethodVisitor mv = cw.visitMethod(ACC_PUBLIC, "matches", "(Ljava/lang/String;)Z", null, null); + mv.visitCode(); + pushTableCallArguments(mv, className, 1); + mv.visitMethodInsn( + INVOKESTATIC, + RUNTIME, + "matches", + tableCallDescriptor("Ljava/lang/CharSequence;", ")Z"), + false); + mv.visitInsn(IRETURN); + mv.visitMaxs(0, 0); + mv.visitEnd(); + } + + public void generateFindMethod(ClassWriter cw, String className) { + MethodVisitor mv = cw.visitMethod(ACC_PUBLIC, "find", "(Ljava/lang/String;)Z", null, null); + mv.visitCode(); + mv.visitVarInsn(ALOAD, 0); + mv.visitVarInsn(ALOAD, 1); + mv.visitInsn(ICONST_0); + mv.visitMethodInsn(INVOKEVIRTUAL, className, "findFrom", "(Ljava/lang/String;I)I", false); + Label notFound = new Label(); + mv.visitJumpInsn(IFLT, notFound); + mv.visitInsn(ICONST_1); + mv.visitInsn(IRETURN); + mv.visitLabel(notFound); + mv.visitInsn(ICONST_0); + mv.visitInsn(IRETURN); + mv.visitMaxs(0, 0); + mv.visitEnd(); + } + + public void generateFindFromMethod(ClassWriter cw, String className) { + MethodVisitor mv = cw.visitMethod(ACC_PUBLIC, "findFrom", "(Ljava/lang/String;I)I", null, null); + mv.visitCode(); + pushTableCallArguments(mv, className, 1, 2); + mv.visitMethodInsn( + INVOKESTATIC, + RUNTIME, + "findFrom", + tableCallDescriptor("Ljava/lang/CharSequence;I", ")I"), + false); + mv.visitInsn(IRETURN); + mv.visitMaxs(0, 0); + mv.visitEnd(); + } + + public void generateMatchMethod(ClassWriter cw, String className) { + MethodVisitor mv = + cw.visitMethod( + ACC_PUBLIC, + "match", + "(Ljava/lang/String;)Lcom/datadoghq/reggie/runtime/MatchResult;", + null, + null); + mv.visitCode(); + mv.visitVarInsn(ALOAD, 0); + mv.visitVarInsn(ALOAD, 1); + mv.visitMethodInsn(INVOKEVIRTUAL, className, "matches", "(Ljava/lang/String;)Z", false); + Label matched = new Label(); + mv.visitJumpInsn(IFNE, matched); + mv.visitInsn(ACONST_NULL); + mv.visitInsn(ARETURN); + mv.visitLabel(matched); + newMatchResult(mv, 1, 0, -1); + mv.visitInsn(ARETURN); + mv.visitMaxs(0, 0); + mv.visitEnd(); + } + + public void generateMatchIntoMethod(ClassWriter cw, String className) { + MethodVisitor mv = + cw.visitMethod(ACC_PUBLIC, "matchInto", "(Ljava/lang/String;[I[I)Z", null, null); + mv.visitCode(); + mv.visitVarInsn(ALOAD, 1); + mv.visitMethodInsn( + INVOKESTATIC, + "java/util/Objects", + "requireNonNull", + "(Ljava/lang/Object;)Ljava/lang/Object;", + false); + mv.visitInsn(POP); + mv.visitVarInsn(ALOAD, 2); + mv.visitMethodInsn( + INVOKESTATIC, + "java/util/Objects", + "requireNonNull", + "(Ljava/lang/Object;)Ljava/lang/Object;", + false); + mv.visitInsn(POP); + mv.visitVarInsn(ALOAD, 3); + mv.visitMethodInsn( + INVOKESTATIC, + "java/util/Objects", + "requireNonNull", + "(Ljava/lang/Object;)Ljava/lang/Object;", + false); + mv.visitInsn(POP); + mv.visitVarInsn(ALOAD, 2); + mv.visitInsn(ARRAYLENGTH); + Label startsOk = new Label(); + mv.visitJumpInsn(IFGT, startsOk); + throwBounds(mv); + mv.visitLabel(startsOk); + mv.visitVarInsn(ALOAD, 3); + mv.visitInsn(ARRAYLENGTH); + Label endsOk = new Label(); + mv.visitJumpInsn(IFGT, endsOk); + throwBounds(mv); + mv.visitLabel(endsOk); + mv.visitVarInsn(ALOAD, 0); + mv.visitVarInsn(ALOAD, 1); + mv.visitMethodInsn(INVOKEVIRTUAL, className, "matches", "(Ljava/lang/String;)Z", false); + Label success = new Label(); + mv.visitJumpInsn(IFNE, success); + mv.visitInsn(ICONST_0); + mv.visitInsn(IRETURN); + mv.visitLabel(success); + mv.visitVarInsn(ALOAD, 2); + mv.visitInsn(ICONST_0); + mv.visitInsn(ICONST_0); + mv.visitInsn(IASTORE); + mv.visitVarInsn(ALOAD, 3); + mv.visitInsn(ICONST_0); + mv.visitVarInsn(ALOAD, 1); + mv.visitMethodInsn(INVOKEVIRTUAL, "java/lang/String", "length", "()I", false); + mv.visitInsn(IASTORE); + mv.visitInsn(ICONST_1); + mv.visitInsn(IRETURN); + mv.visitMaxs(0, 0); + mv.visitEnd(); + } + + public void generateMatchesBoundedMethod(ClassWriter cw, String className) { + MethodVisitor mv = + cw.visitMethod(ACC_PUBLIC, "matchesBounded", "(Ljava/lang/CharSequence;II)Z", null, null); + mv.visitCode(); + pushTableCallArguments(mv, className, 1, 2, 3); + mv.visitMethodInsn( + INVOKESTATIC, + RUNTIME, + "matchesBounded", + tableCallDescriptor("Ljava/lang/CharSequence;II", ")Z"), + false); + mv.visitInsn(IRETURN); + mv.visitMaxs(0, 0); + mv.visitEnd(); + } + + public void generateMatchBoundedMethod(ClassWriter cw, String className) { + MethodVisitor mv = + cw.visitMethod( + ACC_PUBLIC, + "matchBounded", + "(Ljava/lang/CharSequence;II)Lcom/datadoghq/reggie/runtime/MatchResult;", + null, + null); + mv.visitCode(); + mv.visitVarInsn(ALOAD, 0); + mv.visitVarInsn(ALOAD, 1); + mv.visitVarInsn(ILOAD, 2); + mv.visitVarInsn(ILOAD, 3); + mv.visitMethodInsn( + INVOKEVIRTUAL, className, "matchesBounded", "(Ljava/lang/CharSequence;II)Z", false); + Label matched = new Label(); + mv.visitJumpInsn(IFNE, matched); + mv.visitInsn(ACONST_NULL); + mv.visitInsn(ARETURN); + mv.visitLabel(matched); + mv.visitVarInsn(ALOAD, 1); + mv.visitMethodInsn( + INVOKEINTERFACE, "java/lang/CharSequence", "toString", "()Ljava/lang/String;", true); + int stringVar = 4; + mv.visitVarInsn(ASTORE, stringVar); + newMatchResult(mv, stringVar, 2, 3); + mv.visitInsn(ARETURN); + mv.visitMaxs(0, 0); + mv.visitEnd(); + } + + public void generateFindMatchMethod(ClassWriter cw, String className) { + MethodVisitor mv = + cw.visitMethod( + ACC_PUBLIC, + "findMatch", + "(Ljava/lang/String;)Lcom/datadoghq/reggie/runtime/MatchResult;", + null, + null); + mv.visitCode(); + mv.visitVarInsn(ALOAD, 0); + mv.visitVarInsn(ALOAD, 1); + mv.visitInsn(ICONST_0); + mv.visitMethodInsn( + INVOKEVIRTUAL, + className, + "findMatchFrom", + "(Ljava/lang/String;I)Lcom/datadoghq/reggie/runtime/MatchResult;", + false); + mv.visitInsn(ARETURN); + mv.visitMaxs(0, 0); + mv.visitEnd(); + } + + public void generateFindMatchFromMethod(ClassWriter cw, String className) { + MethodVisitor mv = + cw.visitMethod( + ACC_PUBLIC, + "findMatchFrom", + "(Ljava/lang/String;I)Lcom/datadoghq/reggie/runtime/MatchResult;", + null, + null); + mv.visitCode(); + pushInt(mv, 2); + mv.visitIntInsn(NEWARRAY, T_INT); + int boundsVar = 3; + mv.visitVarInsn(ASTORE, boundsVar); + mv.visitVarInsn(ALOAD, 0); + mv.visitVarInsn(ALOAD, 1); + mv.visitVarInsn(ILOAD, 2); + mv.visitVarInsn(ALOAD, boundsVar); + mv.visitMethodInsn( + INVOKEVIRTUAL, className, "findBoundsFrom", "(Ljava/lang/String;I[I)Z", false); + Label matched = new Label(); + mv.visitJumpInsn(IFNE, matched); + mv.visitInsn(ACONST_NULL); + mv.visitInsn(ARETURN); + mv.visitLabel(matched); + mv.visitTypeInsn(NEW, "com/datadoghq/reggie/runtime/MatchResultImpl"); + mv.visitInsn(DUP); + mv.visitVarInsn(ALOAD, 1); + pushInt(mv, 1); + mv.visitIntInsn(NEWARRAY, T_INT); + mv.visitInsn(DUP); + mv.visitInsn(ICONST_0); + mv.visitVarInsn(ALOAD, boundsVar); + mv.visitInsn(ICONST_0); + mv.visitInsn(IALOAD); + mv.visitInsn(IASTORE); + pushInt(mv, 1); + mv.visitIntInsn(NEWARRAY, T_INT); + mv.visitInsn(DUP); + mv.visitInsn(ICONST_0); + mv.visitVarInsn(ALOAD, boundsVar); + mv.visitInsn(ICONST_1); + mv.visitInsn(IALOAD); + mv.visitInsn(IASTORE); + mv.visitInsn(ICONST_0); + mv.visitMethodInsn( + INVOKESPECIAL, + "com/datadoghq/reggie/runtime/MatchResultImpl", + "", + "(Ljava/lang/String;[I[II)V", + false); + mv.visitInsn(ARETURN); + mv.visitMaxs(0, 0); + mv.visitEnd(); + } + + public void generateFindBoundsFromMethod(ClassWriter cw, String className) { + MethodVisitor mv = + cw.visitMethod(ACC_PUBLIC, "findBoundsFrom", "(Ljava/lang/String;I[I)Z", null, null); + mv.visitCode(); + pushFindBoundsTableCallArguments(mv, className, 1, 2, 3); + mv.visitMethodInsn( + INVOKESTATIC, + RUNTIME, + "findBoundsFrom", + tableCallDescriptor("Ljava/lang/CharSequence;I[I", ")Z"), + false); + mv.visitInsn(IRETURN); + mv.visitMaxs(0, 0); + mv.visitEnd(); + } + + private void pushTableCallArguments(MethodVisitor mv, String className, int inputVar) { + mv.visitVarInsn(ALOAD, inputVar); + pushCommonTableArguments(mv, className); + } + + private void pushTableCallArguments( + MethodVisitor mv, String className, int inputVar, int intVar) { + mv.visitVarInsn(ALOAD, inputVar); + mv.visitVarInsn(ILOAD, intVar); + pushCommonTableArguments(mv, className); + } + + private void pushTableCallArguments( + MethodVisitor mv, String className, int inputVar, int intVar1, int intVar2) { + mv.visitVarInsn(ALOAD, inputVar); + mv.visitVarInsn(ILOAD, intVar1); + mv.visitVarInsn(ILOAD, intVar2); + pushCommonTableArguments(mv, className); + } + + private void pushFindBoundsTableCallArguments( + MethodVisitor mv, String className, int inputVar, int intVar, int arrayVar) { + mv.visitVarInsn(ALOAD, inputVar); + mv.visitVarInsn(ILOAD, intVar); + mv.visitVarInsn(ALOAD, arrayVar); + pushCommonTableArguments(mv, className); + } + + private void pushCommonTableArguments(MethodVisitor mv, String className) { + pushInt(mv, table.startState); + pushInt(mv, table.classCount); + mv.visitFieldInsn(GETSTATIC, className, "DFA_TRANSITIONS", "[I"); + mv.visitFieldInsn(GETSTATIC, className, "DFA_ACCEPTING", "[Z"); + mv.visitFieldInsn(GETSTATIC, className, "DFA_ASCII_CLASSES", "[I"); + mv.visitFieldInsn(GETSTATIC, className, "DFA_START_ASCII", "[Z"); + mv.visitFieldInsn(GETSTATIC, className, "DFA_RANGE_STARTS", "[C"); + mv.visitFieldInsn(GETSTATIC, className, "DFA_RANGE_ENDS", "[C"); + mv.visitFieldInsn(GETSTATIC, className, "DFA_RANGE_CLASSES", "[I"); + } + + private static String tableCallDescriptor(String prefix, String suffix) { + return "(" + prefix + "II[I[Z[I[Z[C[C[I" + suffix; + } + + private static void newMatchResult( + MethodVisitor mv, int inputVar, int startVarOrConst, int endVar) { + mv.visitTypeInsn(NEW, "com/datadoghq/reggie/runtime/MatchResultImpl"); + mv.visitInsn(DUP); + mv.visitVarInsn(ALOAD, inputVar); + pushInt(mv, 1); + mv.visitIntInsn(NEWARRAY, T_INT); + mv.visitInsn(DUP); + mv.visitInsn(ICONST_0); + if (startVarOrConst == 0) { + mv.visitInsn(ICONST_0); + } else { + mv.visitVarInsn(ILOAD, startVarOrConst); + } + mv.visitInsn(IASTORE); + pushInt(mv, 1); + mv.visitIntInsn(NEWARRAY, T_INT); + mv.visitInsn(DUP); + mv.visitInsn(ICONST_0); + if (endVar == -1) { + mv.visitVarInsn(ALOAD, inputVar); + mv.visitMethodInsn(INVOKEVIRTUAL, "java/lang/String", "length", "()I", false); + } else { + mv.visitVarInsn(ILOAD, endVar); + } + mv.visitInsn(IASTORE); + mv.visitInsn(ICONST_0); + mv.visitMethodInsn( + INVOKESPECIAL, + "com/datadoghq/reggie/runtime/MatchResultImpl", + "", + "(Ljava/lang/String;[I[II)V", + false); + } + + private static void throwBounds(MethodVisitor mv) { + mv.visitTypeInsn(NEW, "java/lang/IndexOutOfBoundsException"); + mv.visitInsn(DUP); + mv.visitLdcInsn("group arrays must have length at least 1 for this pattern"); + mv.visitMethodInsn( + INVOKESPECIAL, + "java/lang/IndexOutOfBoundsException", + "", + "(Ljava/lang/String;)V", + false); + mv.visitInsn(ATHROW); + } + + private static String[] encodeRle(int[] values) { + StringBuilder encoded = new StringBuilder(); + int index = 0; + while (index < values.length) { + int value = values[index]; + int count = 1; + while (index + count < values.length && values[index + count] == value) { + count++; + } + appendInt(encoded, value); + appendInt(encoded, count); + index += count; + } + return split(encoded.toString()); + } + + private static String[] encodeBooleans(boolean[] values) { + StringBuilder encoded = new StringBuilder(values.length); + for (boolean value : values) { + encoded.append(value ? (char) 1 : (char) 0); + } + return split(encoded.toString()); + } + + private static void appendInt(StringBuilder builder, int value) { + builder.append((char) (value >>> 16)); + builder.append((char) value); + } + + private static String[] split(String value) { + int chunkCount = Math.max(1, (value.length() + STRING_CHUNK_CHARS - 1) / STRING_CHUNK_CHARS); + String[] chunks = new String[chunkCount]; + for (int i = 0; i < chunkCount; i++) { + int start = i * STRING_CHUNK_CHARS; + int end = Math.min(value.length(), start + STRING_CHUNK_CHARS); + chunks[i] = value.substring(start, end); + } + return chunks; + } + + private static void pushStringArray(MethodVisitor mv, String[] chunks) { + pushInt(mv, chunks.length); + mv.visitTypeInsn(ANEWARRAY, "java/lang/String"); + for (int i = 0; i < chunks.length; i++) { + mv.visitInsn(DUP); + pushInt(mv, i); + mv.visitLdcInsn(chunks[i]); + mv.visitInsn(AASTORE); + } + } +} diff --git a/reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/codegen/DFAUnrolledBytecodeGenerator.java b/reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/codegen/DFAUnrolledBytecodeGenerator.java index f79716b..69f20e1 100644 --- a/reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/codegen/DFAUnrolledBytecodeGenerator.java +++ b/reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/codegen/DFAUnrolledBytecodeGenerator.java @@ -22,6 +22,7 @@ import com.datadoghq.reggie.codegen.automaton.CharSet; import com.datadoghq.reggie.codegen.automaton.DFA; import com.datadoghq.reggie.codegen.automaton.NFA; +import java.util.EnumSet; import java.util.HashMap; import java.util.HashSet; import java.util.Map; @@ -275,8 +276,10 @@ private void generateStateCode( Label endOfInput = new Label(); Label assertionFailed = new Label(); - // PROTOTYPE: Generate assertion checks first (before consuming character) - if (!state.assertionChecks.isEmpty()) { + // Assertions on non-accepting states are transition guards. Assertions on accepting states gate + // only acceptance; if they fail before the end of input, outgoing transitions must still be + // tried so a longer match can satisfy the final assertion. + if (!state.accepting && !state.assertionChecks.isEmpty()) { for (AssertionCheck assertion : state.assertionChecks) { generateAssertionCheck(mv, assertion, posVar, assertionFailed, allocator); } @@ -288,34 +291,6 @@ private void generateStateCode( mv.visitMethodInsn(INVOKEVIRTUAL, "java/lang/String", "length", "()I", false); mv.visitJumpInsn(IF_ICMPGE, endOfInput); - // Special check for \Z (STRING_END): if accepting and pos == length-1 and charAt(pos) == '\n', - // accept - if (state.accepting && hasStringEndAnchor) { - // if (pos == input.length() - 1 && input.charAt(pos) == '\n') return true; - Label notStringEnd = new Label(); - - // Check if pos == length - 1 - mv.visitVarInsn(ILOAD, posVar); - mv.visitVarInsn(ALOAD, 1); - mv.visitMethodInsn(INVOKEVIRTUAL, "java/lang/String", "length", "()I", false); - mv.visitInsn(ICONST_1); - mv.visitInsn(ISUB); - mv.visitJumpInsn(IF_ICMPNE, notStringEnd); - - // Check if charAt(pos) == '\n' - mv.visitVarInsn(ALOAD, 1); - mv.visitVarInsn(ILOAD, posVar); - mv.visitMethodInsn(INVOKEVIRTUAL, "java/lang/String", "charAt", "(I)C", false); - pushInt(mv, '\n'); - mv.visitJumpInsn(IF_ICMPNE, notStringEnd); - - // Both conditions met - accept - mv.visitInsn(ICONST_1); - mv.visitInsn(IRETURN); - - mv.visitLabel(notStringEnd); - } - // char ch = input.charAt(pos); int chVar = allocator.allocate(); mv.visitVarInsn(ALOAD, 1); @@ -329,10 +304,14 @@ private void generateStateCode( // Generate transition checks (NO BitSet, direct char comparisons) for (Map.Entry entry : state.transitions.entrySet()) { CharSet chars = entry.getKey(); - DFA.DFAState target = entry.getValue().target; + DFA.DFATransition trans = entry.getValue(); + DFA.DFAState target = trans.target; Label nextCheck = new Label(); generateCharSetCheck(mv, chars, chVar, nextCheck); + if (!trans.entryGuard.isEmpty()) { + emitTransitionEntryGuard(mv, trans.entryGuard, posVar, nextCheck); + } // Match found - jump to target state mv.visitJumpInsn(GOTO, stateLabels.get(target)); @@ -345,10 +324,24 @@ private void generateStateCode( mv.visitInsn(ICONST_0); mv.visitInsn(IRETURN); - // Handle end of input + // Handle end of input (with per-state assertion/acceptance condition gating) mv.visitLabel(endOfInput); if (state.accepting) { - mv.visitInsn(ICONST_1); + for (AssertionCheck assertion : state.assertionChecks) { + generateAssertionCheck(mv, assertion, posVar, assertionFailed, allocator); + } + if (state.acceptanceAnchorConditions.isEmpty()) { + mv.visitInsn(ICONST_1); + } else { + Label rejectAtEnd = new Label(); + Label endAccepted = new Label(); + emitAcceptanceAnchorChecks(mv, state.acceptanceAnchorConditions, posVar, rejectAtEnd); + mv.visitInsn(ICONST_1); + mv.visitJumpInsn(GOTO, endAccepted); + mv.visitLabel(rejectAtEnd); + mv.visitInsn(ICONST_0); + mv.visitLabel(endAccepted); + } } else { mv.visitInsn(ICONST_0); } @@ -456,14 +449,7 @@ private void generateAssertionCheck( } } else { // Character class sequence assertion (e.g., [A-Z][0-9]) - // Optimization: Single bounds check for entire sequence - // if (pos + offset + length > input.length()) fail - mv.visitVarInsn(ILOAD, posVar); - pushInt(mv, assertion.offset + assertion.charSets.size()); - mv.visitInsn(IADD); - mv.visitVarInsn(ALOAD, 1); // input - mv.visitMethodInsn(INVOKEVIRTUAL, "java/lang/String", "length", "()I", false); - mv.visitJumpInsn(IF_ICMPGT, assertionFailed); + Label assertionPassed = new Label(); // Allocate temp vars int checkPosVar = allocator.allocate(); @@ -473,24 +459,30 @@ private void generateAssertionCheck( CharSet charSet = assertion.charSets.get(i); int peekOffset = assertion.offset + i; - // Cache offset calculation - // int checkPos = pos + peekOffset; + // Cache offset calculation: int checkPos = pos + peekOffset; mv.visitVarInsn(ILOAD, posVar); pushInt(mv, peekOffset); mv.visitInsn(IADD); mv.visitVarInsn(ISTORE, checkPosVar); - // Character check: if (!charSet.contains(input.charAt(checkPos))) fail + // Bounds check. + mv.visitVarInsn(ILOAD, checkPosVar); + mv.visitVarInsn(ALOAD, 1); // input + mv.visitMethodInsn(INVOKEVIRTUAL, "java/lang/String", "length", "()I", false); + mv.visitJumpInsn(IF_ICMPGE, assertion.isPositive() ? assertionFailed : assertionPassed); + mv.visitVarInsn(ALOAD, 1); // input mv.visitVarInsn(ILOAD, checkPosVar); mv.visitMethodInsn(INVOKEVIRTUAL, "java/lang/String", "charAt", "(I)C", false); mv.visitVarInsn(ISTORE, chVar); - // Check if character matches charset - Label matches = new Label(); - generateCharSetCheck(mv, charSet, chVar, matches); - mv.visitJumpInsn(GOTO, assertionFailed); // Doesn't match - mv.visitLabel(matches); // Matches - continue + generateCharSetCheck( + mv, charSet, chVar, assertion.isPositive() ? assertionFailed : assertionPassed); + } + + if (!assertion.isPositive()) { + mv.visitJumpInsn(GOTO, assertionFailed); + mv.visitLabel(assertionPassed); } } } else if (assertion.isLookbehind()) { @@ -773,10 +765,12 @@ public void generateFindFromMethod(ClassWriter cw, String className) { mv.visitVarInsn(ILOAD, 3); mv.visitJumpInsn(IF_ICMPGE, outerLoopEnd); - // ANCHOR OPTIMIZATION: Skip positions that can't match due to anchors - // Use requiresStartAnchor (not hasStartAnchor) to handle alternations like (^foo|bar) - // where one branch has anchor but pattern can still match at any position via other branch - if (requiresStartAnchor || hasStringStartAnchor) { + // ANCHOR OPTIMIZATION: Skip positions that can't match due to anchors. + // {@link NFA#requiresStartAnchor()} already treats both START (^) and STRING_START (\A) as + // barriers, so it returns true only when ALL paths to a useful target go through one of + // them. Or-ing in {@code hasStringStartAnchor} on top short-circuits on patterns like + // `]\A|b` where only one branch has \A but the other can still match anywhere. + if (requiresStartAnchor) { // Non-multiline ^ or \A: Only try position 0 // if (tryPos != 0) return -1; Label validPosition = new Label(); @@ -1012,23 +1006,23 @@ public void generateMatchesAtStartMethod(ClassWriter cw) { // Check if start state is accepting BUT also check assertions first! if (dfa.getStartState().accepting) { + Label conditionFailed = new Label(); // Must check assertions before accepting empty match if (!dfa.getStartState().assertionChecks.isEmpty()) { - Label assertionFailed = new Label(); for (AssertionCheck assertion : dfa.getStartState().assertionChecks) { - generateAssertionCheck(mv, assertion, posVar, assertionFailed, allocator); + generateAssertionCheck(mv, assertion, posVar, conditionFailed, allocator); } - // Assertions passed - empty match is valid - mv.visitInsn(ICONST_1); - mv.visitInsn(IRETURN); - - // Assertions failed - continue to try non-empty match - mv.visitLabel(assertionFailed); - } else { - // No assertions - empty match is valid (patterns like a*) - mv.visitInsn(ICONST_1); - mv.visitInsn(IRETURN); } + // Per-state anchor conditions: bare $ et al. accept only when conditions hold at pos. + if (!dfa.getStartState().acceptanceAnchorConditions.isEmpty()) { + emitAcceptanceAnchorChecks( + mv, dfa.getStartState().acceptanceAnchorConditions, posVar, conditionFailed); + } + // Conditions and assertions passed - empty match is valid + mv.visitInsn(ICONST_1); + mv.visitInsn(IRETURN); + // Failed - continue to try non-empty match + mv.visitLabel(conditionFailed); } // Create labels for all states @@ -1061,108 +1055,27 @@ private void generateMatchAtStartStateCode( Label endOfInput = new Label(); Label assertionFailed = new Label(); - // CRITICAL: Check assertions FIRST, before accepting! - // Bug was here: previously returned true immediately for accepting states, - // skipping assertion checks. Pattern a(?=bc) would match 'a' without checking (?=bc). - if (!state.assertionChecks.isEmpty()) { + // For accepting states, assertions gate acceptance only. If they fail, the DFA must still try + // outgoing transitions: e.g. (?<=\\[)[^\\]]+(?=\\]) should not reject after seeing just "v" + // in "[value]" because the trailing lookahead fails there; it should keep consuming until ']'. + if (state.accepting && state != dfa.getStartState()) { + Label continueMatching = new Label(); for (AssertionCheck assertion : state.assertionChecks) { - generateAssertionCheck(mv, assertion, posVar, assertionFailed, allocator); + generateAssertionCheck(mv, assertion, posVar, continueMatching, allocator); } - } - - // After assertions pass, check if this is an accepting state - // If so, return true (we've matched successfully) - // BUT: For patterns with end anchors, we must also check anchor conditions - if (state.accepting && state != dfa.getStartState()) { - if (hasEndAnchor || hasStringEndAnchor || hasStringEndAbsoluteAnchor || hasMultilineEnd) { - // End anchor present - must check position before accepting - Label continueMatching = new Label(); - - // Allocate temp vars for anchor check - int savedPosVar = allocator.allocate(); - int lenVar = allocator.allocate(); - - // Get current position and length for anchor check - mv.visitVarInsn(ILOAD, posVar); - mv.visitVarInsn(ISTORE, savedPosVar); // Save pos temporarily - - mv.visitVarInsn(ALOAD, 1); - mv.visitMethodInsn(INVOKEVIRTUAL, "java/lang/String", "length", "()I", false); - mv.visitVarInsn(ISTORE, lenVar); // Save length temporarily - - if (hasStringEndAbsoluteAnchor) { - // \z: Accept only if pos == length (absolute end) - mv.visitVarInsn(ILOAD, savedPosVar); - mv.visitVarInsn(ILOAD, lenVar); - mv.visitJumpInsn(IF_ICMPNE, continueMatching); - - mv.visitInsn(ICONST_1); - mv.visitInsn(IRETURN); - } else if (hasStringEndAnchor) { - // \Z: Accept if pos == length OR (pos == length-1 AND charAt(pos) == '\n') - Label accept = new Label(); - - // if (pos == length) accept - mv.visitVarInsn(ILOAD, savedPosVar); - mv.visitVarInsn(ILOAD, lenVar); - mv.visitJumpInsn(IF_ICMPEQ, accept); - - // if (pos == length-1 && charAt(pos) == '\n') accept - mv.visitVarInsn(ILOAD, savedPosVar); - mv.visitVarInsn(ILOAD, lenVar); - mv.visitInsn(ICONST_1); - mv.visitInsn(ISUB); - mv.visitJumpInsn(IF_ICMPNE, continueMatching); // pos != length-1 - - mv.visitVarInsn(ALOAD, 1); - mv.visitVarInsn(ILOAD, savedPosVar); - mv.visitMethodInsn(INVOKEVIRTUAL, "java/lang/String", "charAt", "(I)C", false); - pushInt(mv, '\n'); - mv.visitJumpInsn(IF_ICMPNE, continueMatching); // Not '\n' - - mv.visitLabel(accept); - mv.visitInsn(ICONST_1); - mv.visitInsn(IRETURN); - } else if (hasEndAnchor) { - // Non-multiline $: Accept only if pos == length - mv.visitVarInsn(ILOAD, savedPosVar); - mv.visitVarInsn(ILOAD, lenVar); - mv.visitJumpInsn(IF_ICMPNE, continueMatching); - - mv.visitInsn(ICONST_1); - mv.visitInsn(IRETURN); - } else { // hasMultilineEnd - // Multiline $: Accept if pos == length OR charAt(pos) == '\n' - Label accept = new Label(); - - // if (pos == length) accept - mv.visitVarInsn(ILOAD, savedPosVar); - mv.visitVarInsn(ILOAD, lenVar); - mv.visitJumpInsn(IF_ICMPEQ, accept); - - // if (pos < length && charAt(pos) == '\n') accept - mv.visitVarInsn(ILOAD, savedPosVar); - mv.visitVarInsn(ILOAD, lenVar); - mv.visitJumpInsn(IF_ICMPGE, continueMatching); // pos >= length - - mv.visitVarInsn(ALOAD, 1); - mv.visitVarInsn(ILOAD, savedPosVar); - mv.visitMethodInsn(INVOKEVIRTUAL, "java/lang/String", "charAt", "(I)C", false); - pushInt(mv, '\n'); - mv.visitJumpInsn(IF_ICMPNE, continueMatching); - - mv.visitLabel(accept); - mv.visitInsn(ICONST_1); - mv.visitInsn(IRETURN); - } - - // Anchor condition not met, continue matching - mv.visitLabel(continueMatching); + if (state.acceptanceAnchorConditions.isEmpty()) { + mv.visitInsn(ICONST_1); + mv.visitInsn(IRETURN); } else { - // No end anchor - accept immediately + emitAcceptanceAnchorChecks(mv, state.acceptanceAnchorConditions, posVar, continueMatching); mv.visitInsn(ICONST_1); mv.visitInsn(IRETURN); } + mv.visitLabel(continueMatching); + } else if (!state.assertionChecks.isEmpty()) { + for (AssertionCheck assertion : state.assertionChecks) { + generateAssertionCheck(mv, assertion, posVar, assertionFailed, allocator); + } } // if (pos >= input.length()) goto endOfInput @@ -1184,10 +1097,14 @@ private void generateMatchAtStartStateCode( // Check transitions for (Map.Entry entry : state.transitions.entrySet()) { CharSet chars = entry.getKey(); - DFA.DFAState target = entry.getValue().target; + DFA.DFATransition trans = entry.getValue(); + DFA.DFAState target = trans.target; Label nextCheck = new Label(); generateCharSetCheck(mv, chars, chVar, nextCheck); + if (!trans.entryGuard.isEmpty()) { + emitTransitionEntryGuard(mv, trans.entryGuard, posVar, nextCheck); + } mv.visitJumpInsn(GOTO, stateLabels.get(target)); mv.visitLabel(nextCheck); } @@ -1196,10 +1113,23 @@ private void generateMatchAtStartStateCode( mv.visitInsn(ICONST_0); mv.visitInsn(IRETURN); - // End of input - check if accepting + // End of input - check if accepting (respecting per-state anchor conditions). Accepting states + // with assertions were already evaluated before the transition block; if those assertions + // failed and we got here, the match must fail rather than accepting at end-of-input. mv.visitLabel(endOfInput); - if (state.accepting) { - mv.visitInsn(ICONST_1); + if (state.accepting && state.assertionChecks.isEmpty()) { + if (state.acceptanceAnchorConditions.isEmpty()) { + mv.visitInsn(ICONST_1); + } else { + Label rejectAtEnd = new Label(); + Label endAccepted = new Label(); + emitAcceptanceAnchorChecks(mv, state.acceptanceAnchorConditions, posVar, rejectAtEnd); + mv.visitInsn(ICONST_1); + mv.visitJumpInsn(GOTO, endAccepted); + mv.visitLabel(rejectAtEnd); + mv.visitInsn(ICONST_0); + mv.visitLabel(endAccepted); + } } else { mv.visitInsn(ICONST_0); } @@ -1585,17 +1515,27 @@ private void generateTaggedDFAMatching( for (DFA.DFAState state : dfa.getAllStates()) { mv.visitLabel(stateLabels.get(state)); - // If this is an accepting state, save current position and tags + // If this is an accepting state, save current position and tags (gated on anchor + // conditions — e.g. a state accepting only at ^ must not record mid-input positions). if (state.accepting) { - // longestPos = pos; - mv.visitVarInsn(ILOAD, posVar); - mv.visitVarInsn(ISTORE, longestPosVar); - - // savedTags = tags.clone(); - mv.visitVarInsn(ALOAD, tagsVar); - mv.visitMethodInsn(INVOKEVIRTUAL, "[I", "clone", "()Ljava/lang/Object;", false); - mv.visitTypeInsn(CHECKCAST, "[I"); - mv.visitVarInsn(ASTORE, savedTagsVar); + if (state.acceptanceAnchorConditions.isEmpty()) { + mv.visitVarInsn(ILOAD, posVar); + mv.visitVarInsn(ISTORE, longestPosVar); + mv.visitVarInsn(ALOAD, tagsVar); + mv.visitMethodInsn(INVOKEVIRTUAL, "[I", "clone", "()Ljava/lang/Object;", false); + mv.visitTypeInsn(CHECKCAST, "[I"); + mv.visitVarInsn(ASTORE, savedTagsVar); + } else { + Label skipSave = new Label(); + emitAcceptanceAnchorChecks(mv, state.acceptanceAnchorConditions, posVar, skipSave); + mv.visitVarInsn(ILOAD, posVar); + mv.visitVarInsn(ISTORE, longestPosVar); + mv.visitVarInsn(ALOAD, tagsVar); + mv.visitMethodInsn(INVOKEVIRTUAL, "[I", "clone", "()Ljava/lang/Object;", false); + mv.visitTypeInsn(CHECKCAST, "[I"); + mv.visitVarInsn(ASTORE, savedTagsVar); + mv.visitLabel(skipSave); + } } // Check if we've reached end of input @@ -1965,10 +1905,18 @@ private void generateGreedyStateCode( } // If this is an accepting state, record current position (after assertions pass). + // Per-state acceptance conditions must be satisfied before recording the position. if (state.accepting) { - // longestMatchEnd = pos; - mv.visitVarInsn(ILOAD, posVar); - mv.visitVarInsn(ISTORE, longestMatchEndVar); + if (state.acceptanceAnchorConditions.isEmpty()) { + mv.visitVarInsn(ILOAD, posVar); + mv.visitVarInsn(ISTORE, longestMatchEndVar); + } else { + Label skipRecord = new Label(); + emitAcceptanceAnchorChecks(mv, state.acceptanceAnchorConditions, posVar, skipRecord); + mv.visitVarInsn(ILOAD, posVar); + mv.visitVarInsn(ISTORE, longestMatchEndVar); + mv.visitLabel(skipRecord); + } } // if (pos >= input.length()) goto endOfInput @@ -2269,13 +2217,18 @@ private void generateBoundedStateCode( // pos++; mv.visitIincInsn(posVar, 1); - // Generate transition checks + // Generate transition checks (bounded path uses CharSequence) + InputAccess boundedAccess = charSequenceInputAccess(mv, endVar); for (Map.Entry entry : state.transitions.entrySet()) { CharSet chars = entry.getKey(); - DFA.DFAState target = entry.getValue().target; + DFA.DFATransition trans = entry.getValue(); + DFA.DFAState target = trans.target; Label nextCheck = new Label(); generateCharSetCheck(mv, chars, 5, nextCheck); + if (!trans.entryGuard.isEmpty()) { + emitTransitionEntryGuard(mv, trans.entryGuard, posVar, nextCheck, boundedAccess); + } // Match found - jump to target state mv.visitJumpInsn(GOTO, stateLabels.get(target)); @@ -2288,10 +2241,22 @@ private void generateBoundedStateCode( mv.visitInsn(ICONST_0); mv.visitInsn(IRETURN); - // Handle end of region + // Handle end of region (with per-state acceptance condition gating) mv.visitLabel(endOfRegion); if (state.accepting) { - mv.visitInsn(ICONST_1); + if (state.acceptanceAnchorConditions.isEmpty()) { + mv.visitInsn(ICONST_1); + } else { + Label rejectAtEnd = new Label(); + Label endAccepted = new Label(); + emitAcceptanceAnchorChecksBounded( + mv, state.acceptanceAnchorConditions, posVar, endVar, rejectAtEnd); + mv.visitInsn(ICONST_1); + mv.visitJumpInsn(GOTO, endAccepted); + mv.visitLabel(rejectAtEnd); + mv.visitInsn(ICONST_0); + mv.visitLabel(endAccepted); + } } else { mv.visitInsn(ICONST_0); } @@ -3258,4 +3223,210 @@ private void generateAssertionCheckWithGroups( generateAssertionGroupCapture(mv, assertion, posVar, groupStartsVar, groupEndsVar); } } + + // ============================================================================================ + // Per-state anchor condition emission (replaces the legacy global hasEndAnchor / requiresStart + // checks at accept sites). Each accepting DFAState carries an EnumSet of anchor types that + // must hold at the current input position before acceptance is granted; each DFATransition + // similarly carries an EnumSet of START-class anchor types that must hold at the *source* + // position before the transition is taken. + // ============================================================================================ + + /** + * Strategy bundle for emitting input-shape-specific bytecode (String vs CharSequence). Stays + * local to anchor-check helpers so the rest of the generator is unaffected. + */ + private static final class InputAccess { + /** Emit {@code input.length()} on the stack. */ + final Runnable loadLength; + + /** Emit {@code input.charAt()}. The int operand is consumed. */ + final Runnable invokeCharAt; + + InputAccess(Runnable loadLength, Runnable invokeCharAt) { + this.loadLength = loadLength; + this.invokeCharAt = invokeCharAt; + } + } + + /** Access for the String-typed input held in local slot 1 (unbounded path). */ + private InputAccess stringInputAccess(MethodVisitor mv) { + return new InputAccess( + () -> { + mv.visitVarInsn(ALOAD, 1); + mv.visitMethodInsn(INVOKEVIRTUAL, "java/lang/String", "length", "()I", false); + }, + () -> mv.visitMethodInsn(INVOKEVIRTUAL, "java/lang/String", "charAt", "(I)C", false)); + } + + /** Access for the CharSequence-typed input held in local slot 1 (bounded path). */ + private InputAccess charSequenceInputAccess(MethodVisitor mv, int endVar) { + return new InputAccess( + () -> mv.visitVarInsn(ILOAD, endVar), + () -> + mv.visitMethodInsn(INVOKEINTERFACE, "java/lang/CharSequence", "charAt", "(I)C", true)); + } + + /** + * Emit a check for one anchor type at {@code posVar} against the end position; jump to {@code + * failed} if the anchor's positional precondition does not hold. + */ + private void emitSingleAnchorCheck( + MethodVisitor mv, NFA.AnchorType anchor, int posVar, Label failed, InputAccess access) { + switch (anchor) { + case STRING_END_ABSOLUTE: + // \z — strict end: if (pos != end) goto failed + mv.visitVarInsn(ILOAD, posVar); + access.loadLength.run(); + mv.visitJumpInsn(IF_ICMPNE, failed); + break; + case START: + case STRING_START: + // if (pos != 0) goto failed + mv.visitVarInsn(ILOAD, posVar); + mv.visitJumpInsn(IFNE, failed); + break; + case END: + // $ (non-multiline) — same semantics as \Z: matches at end OR before final '\n'. + // Java's $ is NOT strict: Pattern.compile("x$").matcher("x\n").find() == true. + // Fall through to STRING_END. + case STRING_END: + { + // OK iff pos == end OR (pos == end - 1 AND charAt(pos) == '\n') + Label ok = new Label(); + mv.visitVarInsn(ILOAD, posVar); + access.loadLength.run(); + mv.visitJumpInsn(IF_ICMPEQ, ok); + mv.visitVarInsn(ILOAD, posVar); + access.loadLength.run(); + mv.visitInsn(ICONST_1); + mv.visitInsn(ISUB); + mv.visitJumpInsn(IF_ICMPNE, failed); + mv.visitVarInsn(ALOAD, 1); + mv.visitVarInsn(ILOAD, posVar); + access.invokeCharAt.run(); + pushInt(mv, '\n'); + mv.visitJumpInsn(IF_ICMPNE, failed); + mv.visitLabel(ok); + break; + } + case END_MULTILINE: + { + // OK iff pos == end OR charAt(pos) == '\n' + Label ok = new Label(); + mv.visitVarInsn(ILOAD, posVar); + access.loadLength.run(); + mv.visitJumpInsn(IF_ICMPEQ, ok); + mv.visitVarInsn(ALOAD, 1); + mv.visitVarInsn(ILOAD, posVar); + access.invokeCharAt.run(); + pushInt(mv, '\n'); + mv.visitJumpInsn(IF_ICMPNE, failed); + mv.visitLabel(ok); + break; + } + case START_MULTILINE: + { + // OK iff pos == 0 OR charAt(pos - 1) == '\n' + Label ok = new Label(); + mv.visitVarInsn(ILOAD, posVar); + mv.visitJumpInsn(IFEQ, ok); + mv.visitVarInsn(ALOAD, 1); + mv.visitVarInsn(ILOAD, posVar); + mv.visitInsn(ICONST_1); + mv.visitInsn(ISUB); + access.invokeCharAt.run(); + pushInt(mv, '\n'); + mv.visitJumpInsn(IF_ICMPNE, failed); + mv.visitLabel(ok); + break; + } + default: + break; + } + } + + /** Emit acceptance-condition checks against {@code input.length()} (String path). */ + private void emitAcceptanceAnchorChecks( + MethodVisitor mv, EnumSet conditions, int posVar, Label failed) { + InputAccess access = stringInputAccess(mv); + for (NFA.AnchorType anchor : conditions) { + emitSingleAnchorCheck(mv, anchor, posVar, failed, access); + } + } + + /** + * Emit acceptance-condition checks against a precomputed end length held in {@code endVar} + * (CharSequence path). + */ + private void emitAcceptanceAnchorChecksBounded( + MethodVisitor mv, EnumSet conditions, int posVar, int endVar, Label failed) { + InputAccess access = charSequenceInputAccess(mv, endVar); + for (NFA.AnchorType anchor : conditions) { + emitSingleAnchorCheck(mv, anchor, posVar, failed, access); + } + } + + /** Emit a transition entry-guard check (String input path). */ + private void emitTransitionEntryGuard( + MethodVisitor mv, EnumSet entryGuard, int posVar, Label skipTransition) { + emitTransitionEntryGuard(mv, entryGuard, posVar, skipTransition, stringInputAccess(mv)); + } + + /** + * Emit a transition entry-guard check. {@code posVar} is the position AFTER the char was consumed + * (i.e. source pos + 1). If any START-class anchor required by the guard does not hold at the + * source position, jump to {@code skipTransition}. {@code access} selects the + * input-shape-specific {@code charAt} call kind (String virtual vs CharSequence interface). + */ + private void emitTransitionEntryGuard( + MethodVisitor mv, + EnumSet entryGuard, + int posVar, + Label skipTransition, + InputAccess access) { + for (NFA.AnchorType anchor : entryGuard) { + switch (anchor) { + case START: + case STRING_START: + // source_pos == 0 iff pos == 1 (after pre-transition increment) + mv.visitVarInsn(ILOAD, posVar); + mv.visitInsn(ICONST_1); + mv.visitJumpInsn(IF_ICMPNE, skipTransition); + break; + case START_MULTILINE: + { + // source_pos == 0 OR charAt(source_pos - 1) == '\n'. + // After the consume-increment, source_pos = pos - 1, so we test charAt(pos - 2). + Label ok = new Label(); + mv.visitVarInsn(ILOAD, posVar); + mv.visitInsn(ICONST_1); + mv.visitJumpInsn(IF_ICMPEQ, ok); // pos == 1 → source_pos == 0 + // Need pos >= 2 to look at charAt(pos - 2). + mv.visitVarInsn(ILOAD, posVar); + mv.visitInsn(ICONST_2); + mv.visitJumpInsn(IF_ICMPLT, skipTransition); + mv.visitVarInsn(ALOAD, 1); + mv.visitVarInsn(ILOAD, posVar); + mv.visitInsn(ICONST_2); + mv.visitInsn(ISUB); + access.invokeCharAt.run(); + pushInt(mv, '\n'); + mv.visitJumpInsn(IF_ICMPNE, skipTransition); + mv.visitLabel(ok); + break; + } + case END: + case STRING_END_ABSOLUTE: + case STRING_END: + case END_MULTILINE: + // END-class anchors are pruned at construction; if one slipped through, kill the + // transition unconditionally. + mv.visitJumpInsn(GOTO, skipTransition); + break; + default: + break; + } + } + } } diff --git a/reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/codegen/GreedyCharClassBytecodeGenerator.java b/reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/codegen/GreedyCharClassBytecodeGenerator.java index 0aa9d72..412affc 100644 --- a/reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/codegen/GreedyCharClassBytecodeGenerator.java +++ b/reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/codegen/GreedyCharClassBytecodeGenerator.java @@ -332,6 +332,13 @@ public void generateFindFromMethod(ClassWriter cw, String className) { mv.visitLabel(checksPass); + // For min=0 (like * or {0,}), any position is a valid match start — empty match always valid. + // Return start immediately so the caller's greedy scan determines the actual end. + if (minMatches == 0) { + mv.visitVarInsn(ILOAD, startVar); + mv.visitInsn(IRETURN); + } + // int len = input.length(); mv.visitVarInsn(ALOAD, inputVar); mv.visitMethodInsn(INVOKEVIRTUAL, "java/lang/String", "length", "()I", false); diff --git a/reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/codegen/LinearPatternBytecodeGenerator.java b/reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/codegen/LinearPatternBytecodeGenerator.java index f4eefba..8a6f540 100644 --- a/reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/codegen/LinearPatternBytecodeGenerator.java +++ b/reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/codegen/LinearPatternBytecodeGenerator.java @@ -622,11 +622,25 @@ private void generateAnchorCheck( mv.visitJumpInsn(IFNE, ctx.failLabel); break; case END: - // if (pos != len) fail; - mv.visitVarInsn(ILOAD, ctx.posVar); - mv.visitVarInsn(ILOAD, ctx.lenVar); - mv.visitJumpInsn(IF_ICMPNE, ctx.failLabel); - break; + { + // $ (non-multiline): same as \Z — pos == len OR (pos == len-1 AND charAt(pos) == '\n') + Label endPassLabel = new Label(); + mv.visitVarInsn(ILOAD, ctx.posVar); + mv.visitVarInsn(ILOAD, ctx.lenVar); + mv.visitJumpInsn(IF_ICMPEQ, endPassLabel); + mv.visitVarInsn(ILOAD, ctx.posVar); + mv.visitVarInsn(ILOAD, ctx.lenVar); + mv.visitInsn(ICONST_1); + mv.visitInsn(ISUB); + mv.visitJumpInsn(IF_ICMPNE, ctx.failLabel); + mv.visitVarInsn(ALOAD, ctx.inputVar); + mv.visitVarInsn(ILOAD, ctx.posVar); + mv.visitMethodInsn(INVOKEVIRTUAL, "java/lang/String", "charAt", "(I)C", false); + BytecodeUtil.pushInt(mv, '\n'); + mv.visitJumpInsn(IF_ICMPNE, ctx.failLabel); + mv.visitLabel(endPassLabel); + break; + } case START_MULTILINE: // if (pos != 0 && (pos == 0 || input.charAt(pos-1) != '\n')) fail; // Simplified: if (pos == 0 || input.charAt(pos-1) == '\n') pass; else fail; diff --git a/reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/codegen/NFABytecodeGenerator.java b/reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/codegen/NFABytecodeGenerator.java index a04059e..90faf28 100644 --- a/reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/codegen/NFABytecodeGenerator.java +++ b/reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/codegen/NFABytecodeGenerator.java @@ -2140,14 +2140,16 @@ private void generateEpsilonClosure( pushInt(mv, '\n'); mv.visitJumpInsn(IF_ICMPNE, worklistLoop); break; - case END: case STRING_END_ABSOLUTE: - // pos == input.length() + // \z: strict pos == input.length() mv.visitVarInsn(ILOAD, posVar); mv.visitVarInsn(ALOAD, inputVar); mv.visitMethodInsn(INVOKEVIRTUAL, "java/lang/String", "length", "()I", false); mv.visitJumpInsn(IF_ICMPNE, worklistLoop); break; + case END: + // $ (non-multiline): same as \Z — pos == length OR before final '\n'. + // Fall through to STRING_END. case STRING_END: // \Z: pos == length || (pos == length-1 && charAt(pos) == '\n') mv.visitVarInsn(ILOAD, posVar); @@ -6288,6 +6290,352 @@ public void generateMatchMethod(ClassWriter cw, String className) { mv.visitEnd(); } + /** + * Generate matchInto() method that writes capture boundaries to caller-provided arrays without + * allocating MatchResultImpl or per-call group arrays. + */ + public void generateMatchIntoMethod(ClassWriter cw, String className) { + MethodVisitor mv = + cw.visitMethod(ACC_PUBLIC, "matchInto", "(Ljava/lang/String;[I[I)Z", null, null); + mv.visitCode(); + + // Method signature: matchInto(String input, int[] outStarts, int[] outEnds) + // Slots: 0=this, 1=input, 2=outStarts, 3=outEnds, 4+=locals. + LocalVariableAllocator allocator = new LocalVariableAllocator(4); + int groupCount = nfa.getGroupCount(); + int requiredGroups = groupCount + 1; + + // Objects.requireNonNull(input/outStarts/outEnds) + mv.visitVarInsn(ALOAD, 1); + mv.visitLdcInsn("input"); + mv.visitMethodInsn( + INVOKESTATIC, + "java/util/Objects", + "requireNonNull", + "(Ljava/lang/Object;Ljava/lang/String;)Ljava/lang/Object;", + false); + mv.visitInsn(POP); + mv.visitVarInsn(ALOAD, 2); + mv.visitLdcInsn("groupStarts"); + mv.visitMethodInsn( + INVOKESTATIC, + "java/util/Objects", + "requireNonNull", + "(Ljava/lang/Object;Ljava/lang/String;)Ljava/lang/Object;", + false); + mv.visitInsn(POP); + mv.visitVarInsn(ALOAD, 3); + mv.visitLdcInsn("groupEnds"); + mv.visitMethodInsn( + INVOKESTATIC, + "java/util/Objects", + "requireNonNull", + "(Ljava/lang/Object;Ljava/lang/String;)Ljava/lang/Object;", + false); + mv.visitInsn(POP); + + Label startsLengthOk = new Label(); + mv.visitVarInsn(ALOAD, 2); + mv.visitInsn(ARRAYLENGTH); + pushInt(mv, requiredGroups); + mv.visitJumpInsn(IF_ICMPGE, startsLengthOk); + generateGroupArrayTooSmallThrow(mv, requiredGroups); + mv.visitLabel(startsLengthOk); + + Label endsLengthOk = new Label(); + mv.visitVarInsn(ALOAD, 3); + mv.visitInsn(ARRAYLENGTH); + pushInt(mv, requiredGroups); + mv.visitJumpInsn(IF_ICMPGE, endsLengthOk); + generateGroupArrayTooSmallThrow(mv, requiredGroups); + mv.visitLabel(endsLengthOk); + + int groupStartsVar = allocator.allocateRef(); + int groupEndsVar = allocator.allocateRef(); + int currentStatesVar; + int nextStatesVar; + if (useSingleLong) { + currentStatesVar = allocator.allocateLong(); + nextStatesVar = allocator.allocateLong(); + } else if (useDualLong) { + currentStatesVar = allocateDualLongStateSet(allocator); + nextStatesVar = allocateDualLongStateSet(allocator); + } else { + currentStatesVar = allocator.allocateRef(); + nextStatesVar = allocator.allocateRef(); + } + int posVar = allocator.allocateInt(); + int lenVar = allocator.allocateInt(); + + int worklistVar = allocator.allocateRef(); + int stateIdVar = allocator.allocateInt(); + int worklistSizeVar = allocator.allocateInt(); + int processedVar; + if (useSingleLong) { + processedVar = allocator.allocateLong(); + } else if (useDualLong) { + processedVar = allocateDualLongStateSet(allocator); + } else { + processedVar = allocator.allocateRef(); + } + int indexVar = allocator.allocateInt(); + int sizeVar = allocator.allocateInt(); + int parentIdVar = allocator.allocateInt(); + EpsilonClosureSlots epsilonSlots = + new EpsilonClosureSlots( + worklistVar, stateIdVar, worklistSizeVar, processedVar, indexVar, sizeVar, parentIdVar); + + // Reuse scratch capture arrays from ReggieMatcher instead of allocating per call. + mv.visitVarInsn(ALOAD, 0); + mv.visitFieldInsn(GETFIELD, "com/datadoghq/reggie/runtime/ReggieMatcher", "groupStarts", "[I"); + mv.visitVarInsn(ASTORE, groupStartsVar); + mv.visitVarInsn(ALOAD, 0); + mv.visitFieldInsn(GETFIELD, "com/datadoghq/reggie/runtime/ReggieMatcher", "groupEnds", "[I"); + mv.visitVarInsn(ASTORE, groupEndsVar); + + // Reuse epsilon worklist. + mv.visitVarInsn(ALOAD, 0); + mv.visitFieldInsn( + GETFIELD, "com/datadoghq/reggie/runtime/ReggieMatcher", "epsilonWorklist", "[I"); + mv.visitVarInsn(ASTORE, worklistVar); + mv.visitInsn(ICONST_0); + mv.visitVarInsn(ISTORE, stateIdVar); + mv.visitInsn(ICONST_0); + mv.visitVarInsn(ISTORE, worklistSizeVar); + + if (useSingleLong || useDualLong) { + initStateSet(mv, processedVar); + } else { + mv.visitVarInsn(ALOAD, 0); + mv.visitFieldInsn( + GETFIELD, + "com/datadoghq/reggie/runtime/ReggieMatcher", + "epsilonProcessed", + "Lcom/datadoghq/reggie/runtime/StateSet;"); + mv.visitVarInsn(ASTORE, processedVar); + mv.visitVarInsn(ALOAD, processedVar); + mv.visitMethodInsn( + INVOKEVIRTUAL, "com/datadoghq/reggie/runtime/StateSet", "clear", "()V", false); + } + + // Initialize scratch group positions to -1. Caller arrays are left untouched until success. + for (int i = 0; i <= groupCount; i++) { + mv.visitVarInsn(ALOAD, groupStartsVar); + pushInt(mv, i); + mv.visitInsn(ICONST_M1); + mv.visitInsn(IASTORE); + mv.visitVarInsn(ALOAD, groupEndsVar); + pushInt(mv, i); + mv.visitInsn(ICONST_M1); + mv.visitInsn(IASTORE); + } + + int configGroupStartsVar = -1, configGroupEndsVar = -1, parentStateMapVar = -1; + if (usePosixLastMatch) { + configGroupStartsVar = allocator.allocateRef(); + configGroupEndsVar = allocator.allocateRef(); + parentStateMapVar = allocator.allocateRef(); + + mv.visitVarInsn(ALOAD, 0); + mv.visitFieldInsn( + GETFIELD, "com/datadoghq/reggie/runtime/ReggieMatcher", "configGroupStarts", "[[I"); + mv.visitVarInsn(ASTORE, configGroupStartsVar); + mv.visitVarInsn(ALOAD, 0); + mv.visitFieldInsn( + GETFIELD, "com/datadoghq/reggie/runtime/ReggieMatcher", "configGroupEnds", "[[I"); + mv.visitVarInsn(ASTORE, configGroupEndsVar); + mv.visitVarInsn(ALOAD, 0); + mv.visitFieldInsn( + GETFIELD, "com/datadoghq/reggie/runtime/ReggieMatcher", "parentStateMap", "[I"); + mv.visitVarInsn(ASTORE, parentStateMapVar); + + generateInitializeConfigArrays( + mv, + configGroupStartsVar, + configGroupEndsVar, + parentStateMapVar, + stateCount, + groupCount, + allocator); + } + + // Set group 0 start = 0. + mv.visitVarInsn(ALOAD, groupStartsVar); + mv.visitInsn(ICONST_0); + mv.visitInsn(ICONST_0); + mv.visitInsn(IASTORE); + + if (useSingleLong || useDualLong) { + initStateSet(mv, currentStatesVar); + initStateSet(mv, nextStatesVar); + } else { + mv.visitVarInsn(ALOAD, 0); + mv.visitFieldInsn( + GETFIELD, + "com/datadoghq/reggie/runtime/ReggieMatcher", + "currentStates", + "Lcom/datadoghq/reggie/runtime/StateSet;"); + mv.visitVarInsn(ASTORE, currentStatesVar); + mv.visitVarInsn(ALOAD, currentStatesVar); + mv.visitMethodInsn( + INVOKEVIRTUAL, "com/datadoghq/reggie/runtime/StateSet", "clear", "()V", false); + + mv.visitVarInsn(ALOAD, 0); + mv.visitFieldInsn( + GETFIELD, + "com/datadoghq/reggie/runtime/ReggieMatcher", + "nextStates", + "Lcom/datadoghq/reggie/runtime/StateSet;"); + mv.visitVarInsn(ASTORE, nextStatesVar); + mv.visitVarInsn(ALOAD, nextStatesVar); + mv.visitMethodInsn( + INVOKEVIRTUAL, "com/datadoghq/reggie/runtime/StateSet", "clear", "()V", false); + } + + mv.visitInsn(ICONST_0); + mv.visitVarInsn(ISTORE, posVar); + mv.visitVarInsn(ALOAD, 1); + mv.visitMethodInsn(INVOKEVIRTUAL, "java/lang/String", "length", "()I", false); + mv.visitVarInsn(ISTORE, lenVar); + + addStateToSet(mv, currentStatesVar, nfa.getStartState().id, allocator); + generateEpsilonClosureWithGroups( + mv, + currentStatesVar, + 1, + posVar, + groupStartsVar, + groupEndsVar, + configGroupStartsVar, + configGroupEndsVar, + parentStateMapVar, + allocator, + epsilonSlots); + + int chVar = allocator.allocateInt(); + Label loopStart = new Label(); + Label loopEnd = new Label(); + + mv.visitLabel(loopStart); + mv.visitVarInsn(ILOAD, posVar); + mv.visitVarInsn(ILOAD, lenVar); + mv.visitJumpInsn(IF_ICMPGE, loopEnd); + + mv.visitVarInsn(ALOAD, 1); + mv.visitVarInsn(ILOAD, posVar); + mv.visitMethodInsn(INVOKEVIRTUAL, "java/lang/String", "charAt", "(I)C", false); + mv.visitVarInsn(ISTORE, chVar); + mv.visitIincInsn(posVar, 1); + + clearStateSet(mv, nextStatesVar); + generateNFAStep( + mv, + currentStatesVar, + nextStatesVar, + chVar, + configGroupStartsVar, + configGroupEndsVar, + groupCount, + allocator); + generateEpsilonClosureWithGroups( + mv, + nextStatesVar, + 1, + posVar, + groupStartsVar, + groupEndsVar, + configGroupStartsVar, + configGroupEndsVar, + parentStateMapVar, + allocator, + epsilonSlots); + swapStateSets(mv, currentStatesVar, nextStatesVar, allocator); + + mv.visitJumpInsn(GOTO, loopStart); + mv.visitLabel(loopEnd); + + for (NFA.NFAState acceptState : nfa.getAcceptStates()) { + checkStateInSetConst(mv, currentStatesVar, acceptState.id, allocator); + Label notThisAccept = new Label(); + mv.visitJumpInsn(IFEQ, notThisAccept); + + mv.visitVarInsn(ALOAD, groupEndsVar); + mv.visitInsn(ICONST_0); + mv.visitVarInsn(ILOAD, lenVar); + mv.visitInsn(IASTORE); + + if (usePosixLastMatch && configGroupStartsVar >= 0) { + for (int g = 1; g <= groupCount; g++) { + mv.visitVarInsn(ALOAD, groupStartsVar); + pushInt(mv, g); + mv.visitVarInsn(ALOAD, configGroupStartsVar); + pushInt(mv, acceptState.id); + mv.visitInsn(AALOAD); + pushInt(mv, g); + mv.visitInsn(IALOAD); + mv.visitInsn(IASTORE); + + mv.visitVarInsn(ALOAD, groupEndsVar); + pushInt(mv, g); + mv.visitVarInsn(ALOAD, configGroupEndsVar); + pushInt(mv, acceptState.id); + mv.visitInsn(AALOAD); + pushInt(mv, g); + mv.visitInsn(IALOAD); + mv.visitInsn(IASTORE); + } + } + + // Copy scratch arrays to caller arrays only after success. + mv.visitVarInsn(ALOAD, groupStartsVar); + mv.visitInsn(ICONST_0); + mv.visitVarInsn(ALOAD, 2); + mv.visitInsn(ICONST_0); + pushInt(mv, requiredGroups); + mv.visitMethodInsn( + INVOKESTATIC, + "java/lang/System", + "arraycopy", + "(Ljava/lang/Object;ILjava/lang/Object;II)V", + false); + mv.visitVarInsn(ALOAD, groupEndsVar); + mv.visitInsn(ICONST_0); + mv.visitVarInsn(ALOAD, 3); + mv.visitInsn(ICONST_0); + pushInt(mv, requiredGroups); + mv.visitMethodInsn( + INVOKESTATIC, + "java/lang/System", + "arraycopy", + "(Ljava/lang/Object;ILjava/lang/Object;II)V", + false); + + mv.visitInsn(ICONST_1); + mv.visitInsn(IRETURN); + mv.visitLabel(notThisAccept); + } + + mv.visitInsn(ICONST_0); + mv.visitInsn(IRETURN); + + mv.visitMaxs(0, 0); + mv.visitEnd(); + } + + private void generateGroupArrayTooSmallThrow(MethodVisitor mv, int requiredGroups) { + mv.visitTypeInsn(NEW, "java/lang/IndexOutOfBoundsException"); + mv.visitInsn(DUP); + mv.visitLdcInsn( + "group arrays must have length at least " + requiredGroups + " for this pattern"); + mv.visitMethodInsn( + INVOKESPECIAL, + "java/lang/IndexOutOfBoundsException", + "", + "(Ljava/lang/String;)V", + false); + mv.visitInsn(ATHROW); + } + /** * Generate matchBounded() method that matches a substring range without allocation. * @@ -7079,13 +7427,15 @@ else if (state.assertionType != null) { pushInt(mv, '\n'); mv.visitJumpInsn(IF_ICMPNE, worklistLoop); break; - case END: case STRING_END_ABSOLUTE: + // \z: strict pos == input.length() mv.visitVarInsn(ILOAD, posVar); mv.visitVarInsn(ALOAD, inputVar); mv.visitMethodInsn(INVOKEVIRTUAL, "java/lang/String", "length", "()I", false); mv.visitJumpInsn(IF_ICMPNE, worklistLoop); break; + case END: + // $ (non-multiline): same as \Z. Fall through to STRING_END. case STRING_END: // \Z: pos == length || (pos == length-1 && charAt(pos) == '\n') mv.visitVarInsn(ILOAD, posVar); @@ -7307,16 +7657,18 @@ public void generateFindMatchMethod(ClassWriter cw, String className) { * int matchStart = findFrom(input, start); * if (matchStart < 0) return null; * - * // Try all possible match lengths to find the longest (greedy) - * int longestEnd = matchStart; - * for (int matchEnd = matchStart + 1; matchEnd <= input.length(); matchEnd++) { + * // Try all possible match lengths to find the longest (greedy). + * // longestEnd starts at matchStart - 1 as a "no match yet" sentinel. + * // matchEnd starts at matchStart to include zero-width matches. + * int longestEnd = matchStart - 1; + * for (int matchEnd = matchStart; matchEnd <= input.length(); matchEnd++) { * MatchResult candidate = matchBounded(input, matchStart, matchEnd); * if (candidate != null) { * longestEnd = matchEnd; * } * } * - * if (longestEnd == matchStart) return null; // No valid match + * if (longestEnd < matchStart) return null; // No valid match * * // Return final result with group information * return matchBounded(input, matchStart, longestEnd); @@ -7357,14 +7709,14 @@ public void generateFindMatchFromMethod(ClassWriter cw, String className) { mv.visitLabel(found); // Find longest match end by trying all lengths using matchBounded (zero allocations) - // int matchEnd = matchStart + 1; + // int matchEnd = matchStart; — start at matchStart to include zero-width matches mv.visitVarInsn(ILOAD, 3); - mv.visitInsn(ICONST_1); - mv.visitInsn(IADD); mv.visitVarInsn(ISTORE, 4); // matchEnd (current try) - // int longestEnd = matchStart; (no successful match yet) + // int longestEnd = matchStart - 1; — sentinel: no successful match yet mv.visitVarInsn(ILOAD, 3); + mv.visitInsn(ICONST_1); + mv.visitInsn(ISUB); mv.visitVarInsn(ISTORE, 5); // longestEnd Label loopStart = new Label(); @@ -7404,11 +7756,11 @@ public void generateFindMatchFromMethod(ClassWriter cw, String className) { mv.visitLabel(loopEnd); - // if (longestEnd == matchStart) return null; (no match found) + // if (longestEnd < matchStart) return null; (no match found) Label hasMatch = new Label(); mv.visitVarInsn(ILOAD, 5); mv.visitVarInsn(ILOAD, 3); - mv.visitJumpInsn(IF_ICMPNE, hasMatch); + mv.visitJumpInsn(IF_ICMPGE, hasMatch); mv.visitInsn(ACONST_NULL); mv.visitInsn(ARETURN); mv.visitLabel(hasMatch); diff --git a/reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/codegen/OnePassBytecodeGenerator.java b/reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/codegen/OnePassBytecodeGenerator.java index da4ccfe..a22090a 100644 --- a/reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/codegen/OnePassBytecodeGenerator.java +++ b/reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/codegen/OnePassBytecodeGenerator.java @@ -497,21 +497,35 @@ private void generateAnchorCheck( break; case END: - // if (pos == input.length()) pass; else fail; - mv.visitVarInsn(ILOAD, posVar); - mv.visitVarInsn(ALOAD, inputVar); - mv.visitMethodInsn(INVOKEVIRTUAL, "java/lang/String", "length", "()I", false); - mv.visitJumpInsn(IF_ICMPEQ, passLabel); - // Anchor failed - return false/null - if (returnBoolean) { - mv.visitInsn(ICONST_0); - mv.visitInsn(IRETURN); - } else { - mv.visitInsn(ACONST_NULL); - mv.visitInsn(ARETURN); + { + // $ (non-multiline): same as \Z — matches at end OR before final '\n'. + mv.visitVarInsn(ILOAD, posVar); + mv.visitVarInsn(ALOAD, inputVar); + mv.visitMethodInsn(INVOKEVIRTUAL, "java/lang/String", "length", "()I", false); + mv.visitJumpInsn(IF_ICMPEQ, passLabel); + Label endCheckNewline = new Label(); + mv.visitVarInsn(ILOAD, posVar); + mv.visitVarInsn(ALOAD, inputVar); + mv.visitMethodInsn(INVOKEVIRTUAL, "java/lang/String", "length", "()I", false); + mv.visitInsn(ICONST_1); + mv.visitInsn(ISUB); + mv.visitJumpInsn(IF_ICMPNE, endCheckNewline); + mv.visitVarInsn(ALOAD, inputVar); + mv.visitVarInsn(ILOAD, posVar); + mv.visitMethodInsn(INVOKEVIRTUAL, "java/lang/String", "charAt", "(I)C", false); + pushInt(mv, '\n'); + mv.visitJumpInsn(IF_ICMPEQ, passLabel); + mv.visitLabel(endCheckNewline); + if (returnBoolean) { + mv.visitInsn(ICONST_0); + mv.visitInsn(IRETURN); + } else { + mv.visitInsn(ACONST_NULL); + mv.visitInsn(ARETURN); + } + mv.visitLabel(passLabel); + break; } - mv.visitLabel(passLabel); - break; case START_MULTILINE: // if (pos == 0 || input.charAt(pos-1) == '\n') pass; else fail; diff --git a/reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/codegen/OptionalGroupBackrefBytecodeGenerator.java b/reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/codegen/OptionalGroupBackrefBytecodeGenerator.java index c2a6ed5..d1ebe92 100644 --- a/reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/codegen/OptionalGroupBackrefBytecodeGenerator.java +++ b/reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/codegen/OptionalGroupBackrefBytecodeGenerator.java @@ -1710,7 +1710,10 @@ public void generateFindFromMethod(ClassWriter cw) { mv.visitJumpInsn(GOTO, backrefEnd); mv.visitLabel(groupNotMatched); - // Group captured empty - backref matches empty (always satisfied) + // Group did NOT participate (optional group was skipped entirely). + // Java semantics: \N to a non-participating group FAILS — it does not match empty. + // We must try the next start position instead of vacuously succeeding. + mv.visitJumpInsn(GOTO, tryNextStart); mv.visitLabel(backrefEnd); } diff --git a/reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/codegen/QuantifiedGroupBytecodeGenerator.java b/reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/codegen/QuantifiedGroupBytecodeGenerator.java index 3ad972f..7f93258 100644 --- a/reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/codegen/QuantifiedGroupBytecodeGenerator.java +++ b/reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/codegen/QuantifiedGroupBytecodeGenerator.java @@ -402,10 +402,10 @@ private void generateCharCheck(MethodVisitor mv, Label exitLabel, LocalVarAlloca mv.visitVarInsn(ISTORE, charVar); generateCharSetCheck(mv, info.charSet, charVar, exitLabel, false); } else { - // Char class + // Char class — pass the negation flag so [^x] groups are checked correctly int charVar = allocator.allocate(); mv.visitVarInsn(ISTORE, charVar); - generateCharSetCheck(mv, info.charSet, charVar, exitLabel, false); + generateCharSetCheck(mv, info.charSet, charVar, exitLabel, info.isNegatedCharSet()); } } diff --git a/reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/codegen/RecursiveDescentBytecodeGenerator.java b/reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/codegen/RecursiveDescentBytecodeGenerator.java index 2acab01..91a8f30 100644 --- a/reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/codegen/RecursiveDescentBytecodeGenerator.java +++ b/reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/codegen/RecursiveDescentBytecodeGenerator.java @@ -648,6 +648,11 @@ public byte[] generate(String className) { * generating the public API methods. */ public void generateAllParserMethods(ClassWriter cw, String className) { + // Flag set to true while executing find(); false (default) during matches(). + // Lazy quantifiers use this to return the minimum match for find() vs the + // greedy (full-coverage) match for matches(). + cw.visitField(ACC_PRIVATE, "lazyFindMode", "Z", null, null).visitEnd(); + // IMPORTANT: Generate parser methods for AST nodes FIRST // This must happen before generateParseRootMethod, because parseRoot // calls getMethodNameForNode(ast) which adds ast to the map, @@ -751,6 +756,11 @@ public void generateFindBoundsFromMethod(ClassWriter cw, String className) { ACC_PROTECTED, "findBoundsFrom", "(Ljava/lang/CharSequence;I[I)I", null, null); mv.visitCode(); + // lazyFindMode = true: lazy quantifiers return minimum matches during find() + mv.visitVarInsn(ALOAD, 0); + mv.visitInsn(ICONST_1); + mv.visitFieldInsn(PUTFIELD, className, "lazyFindMode", "Z"); + // Local vars: 0=this, 1=charSeq, 2=fromIndex, 3=bounds LocalVarAllocator allocator = new LocalVarAllocator(4); int stringInputVar = allocator.allocate(); @@ -920,6 +930,10 @@ public void generateFindBoundsFromMethod(ClassWriter cw, String className) { // S: [A:[I], I, I] mv.visitInsn(IASTORE); + // Reset lazyFindMode before returning (match found path) + mv.visitVarInsn(ALOAD, 0); + mv.visitInsn(ICONST_0); + mv.visitFieldInsn(PUTFIELD, className, "lazyFindMode", "Z"); // Return start position // S: [] mv.visitVarInsn(ILOAD, posVar); @@ -928,6 +942,10 @@ public void generateFindBoundsFromMethod(ClassWriter cw, String className) { mv.visitLabel(findMatchPositionLoopEnd); // No match found anywhere + // Reset lazyFindMode before returning (no match path) + mv.visitVarInsn(ALOAD, 0); + mv.visitInsn(ICONST_0); + mv.visitFieldInsn(PUTFIELD, className, "lazyFindMode", "Z"); mv.visitInsn(ICONST_M1); mv.visitInsn(IRETURN); @@ -1224,6 +1242,152 @@ public void generateMatchMethod(ClassWriter cw, String className) { mv.visitEnd(); } + /** + * Generate matchInto() method that writes capture boundaries directly into caller-provided + * arrays. Signature: public boolean matchInto(String input, int[] groupStarts, int[] groupEnds) + */ + public void generateMatchIntoMethod(ClassWriter cw, String className) { + MethodVisitor mv = + cw.visitMethod(ACC_PUBLIC, "matchInto", "(Ljava/lang/String;[I[I)Z", null, null); + mv.visitCode(); + + // Local vars: 0=this, 1=input, 2=groupStarts, 3=groupEnds + LocalVarAllocator allocator = new LocalVarAllocator(4); + int groupsVar = allocator.allocate(); + int resultVar = allocator.allocate(); + int iVar = allocator.allocate(); + int requiredGroups = groupCount + 1; + + // Objects.requireNonNull(input/groupStarts/groupEnds) + mv.visitVarInsn(ALOAD, 1); + mv.visitLdcInsn("input"); + mv.visitMethodInsn( + INVOKESTATIC, + "java/util/Objects", + "requireNonNull", + "(Ljava/lang/Object;Ljava/lang/String;)Ljava/lang/Object;", + false); + mv.visitInsn(POP); + mv.visitVarInsn(ALOAD, 2); + mv.visitLdcInsn("groupStarts"); + mv.visitMethodInsn( + INVOKESTATIC, + "java/util/Objects", + "requireNonNull", + "(Ljava/lang/Object;Ljava/lang/String;)Ljava/lang/Object;", + false); + mv.visitInsn(POP); + mv.visitVarInsn(ALOAD, 3); + mv.visitLdcInsn("groupEnds"); + mv.visitMethodInsn( + INVOKESTATIC, + "java/util/Objects", + "requireNonNull", + "(Ljava/lang/Object;Ljava/lang/String;)Ljava/lang/Object;", + false); + mv.visitInsn(POP); + + Label startsLengthOk = new Label(); + mv.visitVarInsn(ALOAD, 2); + mv.visitInsn(ARRAYLENGTH); + BytecodeUtil.pushInt(mv, requiredGroups); + mv.visitJumpInsn(IF_ICMPGE, startsLengthOk); + generateGroupArrayTooSmallThrow(mv, requiredGroups); + mv.visitLabel(startsLengthOk); + + Label endsLengthOk = new Label(); + mv.visitVarInsn(ALOAD, 3); + mv.visitInsn(ARRAYLENGTH); + BytecodeUtil.pushInt(mv, requiredGroups); + mv.visitJumpInsn(IF_ICMPGE, endsLengthOk); + generateGroupArrayTooSmallThrow(mv, requiredGroups); + mv.visitLabel(endsLengthOk); + + // int[] groups = this.recursiveGroups; + mv.visitVarInsn(ALOAD, 0); + mv.visitFieldInsn( + GETFIELD, "com/datadoghq/reggie/runtime/ReggieMatcher", "recursiveGroups", "[I"); + mv.visitVarInsn(ASTORE, groupsVar); + + // Initialize packed groups to -1. Caller arrays remain unchanged until success. + Label initLoopStart = new Label(); + Label initLoopEnd = new Label(); + mv.visitInsn(ICONST_0); + mv.visitVarInsn(ISTORE, iVar); + + mv.visitLabel(initLoopStart); + mv.visitVarInsn(ILOAD, iVar); + mv.visitVarInsn(ALOAD, groupsVar); + mv.visitInsn(ARRAYLENGTH); + mv.visitJumpInsn(IF_ICMPGE, initLoopEnd); + mv.visitVarInsn(ALOAD, groupsVar); + mv.visitVarInsn(ILOAD, iVar); + mv.visitInsn(ICONST_M1); + mv.visitInsn(IASTORE); + mv.visitIincInsn(iVar, 1); + mv.visitJumpInsn(GOTO, initLoopStart); + mv.visitLabel(initLoopEnd); + + // int result = parseRoot(input, 0, input.length(), groups, 0) + mv.visitVarInsn(ALOAD, 0); + mv.visitVarInsn(ALOAD, 1); + mv.visitInsn(ICONST_0); + mv.visitVarInsn(ALOAD, 1); + mv.visitMethodInsn(INVOKEVIRTUAL, "java/lang/String", "length", "()I", false); + mv.visitVarInsn(ALOAD, groupsVar); + mv.visitInsn(ICONST_0); + mv.visitMethodInsn(INVOKESPECIAL, className, "parseRoot", "(Ljava/lang/String;II[II)I", false); + mv.visitVarInsn(ISTORE, resultVar); + + // Full-match check. + mv.visitVarInsn(ILOAD, resultVar); + mv.visitVarInsn(ALOAD, 1); + mv.visitMethodInsn(INVOKEVIRTUAL, "java/lang/String", "length", "()I", false); + Label matchFailed = new Label(); + mv.visitJumpInsn(IF_ICMPNE, matchFailed); + + // Copy packed groups into caller arrays: starts[i] = groups[2*i], ends[i] = groups[2*i + 1] + for (int i = 0; i <= groupCount; i++) { + mv.visitVarInsn(ALOAD, 2); + BytecodeUtil.pushInt(mv, i); + mv.visitVarInsn(ALOAD, groupsVar); + BytecodeUtil.pushInt(mv, i * 2); + mv.visitInsn(IALOAD); + mv.visitInsn(IASTORE); + + mv.visitVarInsn(ALOAD, 3); + BytecodeUtil.pushInt(mv, i); + mv.visitVarInsn(ALOAD, groupsVar); + BytecodeUtil.pushInt(mv, i * 2 + 1); + mv.visitInsn(IALOAD); + mv.visitInsn(IASTORE); + } + + mv.visitInsn(ICONST_1); + mv.visitInsn(IRETURN); + + mv.visitLabel(matchFailed); + mv.visitInsn(ICONST_0); + mv.visitInsn(IRETURN); + + mv.visitMaxs(6, allocator.peek()); + mv.visitEnd(); + } + + private void generateGroupArrayTooSmallThrow(MethodVisitor mv, int requiredGroups) { + mv.visitTypeInsn(NEW, "java/lang/IndexOutOfBoundsException"); + mv.visitInsn(DUP); + mv.visitLdcInsn( + "group arrays must have length at least " + requiredGroups + " for this pattern"); + mv.visitMethodInsn( + INVOKESPECIAL, + "java/lang/IndexOutOfBoundsException", + "", + "(Ljava/lang/String;)V", + false); + mv.visitInsn(ATHROW); + } + /** * Generate find method - searches for pattern in the input string. Signature: public boolean * find(String input) @@ -1854,10 +2018,20 @@ public Void visitQuantifier(QuantifierNode node) { mv.visitLabel(minLoopEnd); } - // Match as many as possible up to max - // For non-greedy quantifiers, the preference for fewer matches is handled - // by generateConcatWithBacktracking when followed by more pattern elements. - // When standalone or at the end of a pattern, always match max. + // Lazy (non-greedy) in find() mode: return the minimum match immediately. + // In matches() mode the greedy extension below ensures the full input can be consumed + // when a lazy quantifier is at the end of a concat with no following sibling. + if (!node.greedy) { + Label matchesModeLabel = new Label(); + mv.visitVarInsn(ALOAD, 0); + mv.visitFieldInsn(GETFIELD, className, "lazyFindMode", "Z"); + mv.visitJumpInsn(IFEQ, matchesModeLabel); // lazyFindMode==false → matches mode, continue + mv.visitVarInsn(ILOAD, 6); // currentPos after min matches + mv.visitInsn(IRETURN); + mv.visitLabel(matchesModeLabel); + } + + // Match as many as possible up to max (used by matches() or greedy find) // PCRE semantics: capturing groups should contain values from LAST iteration Label greedyLoopStart = new Label(); Label greedyLoopEnd = new Label(); @@ -2345,8 +2519,16 @@ private void generateConcatWithBacktracking(ConcatNode node, int backtrackChildI mv.visitVarInsn(ILOAD, 5); // currentPos (before update) Label madeProgressGreedy = new Label(); mv.visitJumpInsn(IF_ICMPNE, madeProgressGreedy); - // Empty match: count it but stop looping - mv.visitIincInsn(10, 1); // matchCount++ for this empty match + // Zero-width match: count it. For patterns like \A{3,} the anchor is always zero-width; + // we must keep counting until matchCount reaches min before stopping, otherwise the + // minimum repetition requirement won't be satisfied. + mv.visitIincInsn(10, 1); // matchCount++ + if (quantNode.min > 1) { + // If matchCount is still below min, continue counting (safe: pos doesn't change) + mv.visitVarInsn(ILOAD, 10); // matchCount + BytecodeUtil.pushInt(mv, quantNode.min); + mv.visitJumpInsn(IF_ICMPLT, greedyLoop); // matchCount < min → keep going + } mv.visitJumpInsn(GOTO, greedyEnd); mv.visitLabel(madeProgressGreedy); @@ -2585,7 +2767,23 @@ private void generateConcatWithBacktracking(ConcatNode node, int backtrackChildI node, nestedBacktrackIndex, backtrackLoop, quantNode.greedy ? -1 : 1, 9, 16); } - // All remaining children succeeded + // All remaining children succeeded. + // For a lazy quantifier in matches() mode: if the position hasn't reached the required + // end yet, extend by one more iteration so the full input can be covered. + if (!quantNode.greedy) { + Label returnNow = new Label(); + // If lazyFindMode == true (find mode), return immediately on first success. + mv.visitVarInsn(ALOAD, 0); + mv.visitFieldInsn(GETFIELD, className, "lazyFindMode", "Z"); + mv.visitJumpInsn(IFNE, returnNow); + // Matches mode: if pos < end, try one more quantifier iteration. + mv.visitVarInsn(ILOAD, 5); // currentPos + mv.visitVarInsn(ILOAD, 3); // end + mv.visitJumpInsn(IF_ICMPGE, returnNow); // pos >= end → done + mv.visitIincInsn(9, 1); // tryMatchCount++ (lazy: increment toward max) + mv.visitJumpInsn(GOTO, backtrackLoop); + mv.visitLabel(returnNow); + } mv.visitVarInsn(ILOAD, 5); mv.visitInsn(IRETURN); @@ -3310,16 +3508,40 @@ public Void visitAnchor(AnchorNode node) { mv.visitLabel(atEnd); mv.visitVarInsn(ILOAD, 2); mv.visitInsn(IRETURN); - } else if (node.type == AnchorNode.Type.END - || node.type == AnchorNode.Type.STRING_END_ABSOLUTE) { - // $ (non-multiline) or \z: must be at end of input + } else if (node.type == AnchorNode.Type.STRING_END_ABSOLUTE) { + // \z: strict end of input only mv.visitVarInsn(ILOAD, 2); // pos mv.visitVarInsn(ILOAD, 3); // end - Label atEnd = new Label(); - mv.visitJumpInsn(IF_ICMPEQ, atEnd); + Label atAbsEnd = new Label(); + mv.visitJumpInsn(IF_ICMPEQ, atAbsEnd); mv.visitInsn(ICONST_M1); mv.visitInsn(IRETURN); - mv.visitLabel(atEnd); + mv.visitLabel(atAbsEnd); + mv.visitVarInsn(ILOAD, 2); + mv.visitInsn(IRETURN); + } else if (node.type == AnchorNode.Type.END) { + // $ (non-multiline): same as \Z — pos == end OR (pos == end-1 AND charAt(pos) == '\n') + mv.visitVarInsn(ILOAD, 2); // pos + mv.visitVarInsn(ILOAD, 3); // end + Label dollarOk = new Label(); + mv.visitJumpInsn(IF_ICMPEQ, dollarOk); + // pos != end: check if pos == end-1 AND charAt(pos) == '\n' + mv.visitVarInsn(ILOAD, 2); + mv.visitVarInsn(ILOAD, 3); + mv.visitInsn(ICONST_1); + mv.visitInsn(ISUB); + Label dollarFail = new Label(); + mv.visitJumpInsn(IF_ICMPNE, dollarFail); + mv.visitVarInsn(ALOAD, 1); // input + mv.visitVarInsn(ILOAD, 2); + mv.visitMethodInsn(INVOKEVIRTUAL, "java/lang/String", "charAt", "(I)C", false); + mv.visitIntInsn(BIPUSH, '\n'); + mv.visitJumpInsn(IF_ICMPNE, dollarFail); + mv.visitJumpInsn(GOTO, dollarOk); + mv.visitLabel(dollarFail); + mv.visitInsn(ICONST_M1); + mv.visitInsn(IRETURN); + mv.visitLabel(dollarOk); mv.visitVarInsn(ILOAD, 2); mv.visitInsn(IRETURN); } @@ -3391,8 +3613,8 @@ public Void visitBackreference(BackreferenceNode node) { mv.visitLabel(endIndexUpperBoundsOk); // Check if group has been captured - // PCRE semantics: An uncaptured backreference matches an empty string (0 chars) - // C-03: Also match empty string when group is in partial-open state + // JDK semantics: a backref to a group that never participated fails (returns -1). + // C-03: Match empty string when group is in partial-open state // (groups[startIndex] >= 0 AND groups[endIndex] == -1), which means we are // currently inside that group's first iteration (self-referencing backref). Label groupCaptured = new Label(); @@ -3402,8 +3624,8 @@ public Void visitBackreference(BackreferenceNode node) { mv.visitInsn(ICONST_M1); mv.visitJumpInsn(IF_ICMPNE, groupCaptured); - // groups[startIndex] == -1: group not captured at all - match empty string - mv.visitVarInsn(ILOAD, 2); // pos + // groups[startIndex] == -1: group never captured — fail (JDK semantics) + mv.visitInsn(ICONST_M1); mv.visitInsn(IRETURN); mv.visitLabel(groupCaptured); diff --git a/reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/codegen/SWARPatternAnalyzer.java b/reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/codegen/SWARPatternAnalyzer.java index de2acf3..ff39495 100644 --- a/reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/codegen/SWARPatternAnalyzer.java +++ b/reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/codegen/SWARPatternAnalyzer.java @@ -85,7 +85,12 @@ public static SWAROptimization analyzeForSWAR(CharSet charset, boolean negated) return new HexDigitOptimization(); } - // Case 4: Multi-range patterns + // Case 4: Multi-range patterns. MultiRangeOptimization only emits correct bytecode for the + // two hand-written shapes [a-zA-Z] and [a-zA-Z0-9]; for any other combination it silently + // falls back to a single-range scan over the first range, which causes find() to miss + // matches that start with characters from the dropped ranges (e.g. `[-_0-9]` finds only + // `-`). Until the general multi-range search is implemented, only opt in for the supported + // shapes here. List ranges = charset.getRanges(); if (!negated && ranges.size() >= 2 && ranges.size() <= 4) { // Calculate total coverage @@ -107,34 +112,55 @@ public static SWAROptimization analyzeForSWAR(CharSet charset, boolean negated) rangeArray[i * 2] = ranges.get(i).start; rangeArray[i * 2 + 1] = ranges.get(i).end; } - return new MultiRangeOptimization(rangeArray); - } - } - - // Case 5: Small literal set (up to 4 discrete characters) - // This would be ranges where each range is a single character - if (!negated && ranges.size() <= 4 && ranges.size() >= 2) { - boolean allSingleChar = true; - for (CharSet.Range range : ranges) { - if (range.start != range.end || range.start > 0xFF) { - allSingleChar = false; - break; - } - } - - if (allSingleChar) { - char[] literals = new char[ranges.size()]; - for (int i = 0; i < ranges.size(); i++) { - literals[i] = ranges.get(i).start; + if (isSupportedMultiRangeShape(rangeArray)) { + return new MultiRangeOptimization(rangeArray); } - return new LiteralSetOptimization(literals); + // Fall through to literal-set case 5 below; otherwise return null at the end. } } + // Case 5: Small literal set — disabled. + // LiteralSetOptimization.generateFindNextBytecode only searches for literals[0], + // causing find() to miss matches starting with any other literal in the set. + // TODO: implement a correct multi-literal scan before re-enabling. + // Default: no optimization, use existing charAt() logic return null; } + /** + * Returns true when the multi-range layout matches one of the two shapes that {@link + * com.datadoghq.reggie.codegen.codegen.swar.MultiRangeOptimization} actually emits correct + * bytecode for: {@code [a-zA-Z]} and {@code [a-zA-Z0-9]} / {@code [0-9a-zA-Z]}. Any other shape + * falls into the generator's first-range-only fallback and silently misses matches. + */ + private static boolean isSupportedMultiRangeShape(char[] ranges) { + if (ranges.length == 4) { + // [a-zA-Z] + return ranges[0] == 'a' && ranges[1] == 'z' && ranges[2] == 'A' && ranges[3] == 'Z'; + } + if (ranges.length == 6) { + // [0-9a-zA-Z] + boolean variant1 = + ranges[0] == '0' + && ranges[1] == '9' + && ranges[2] == 'a' + && ranges[3] == 'z' + && ranges[4] == 'A' + && ranges[5] == 'Z'; + // [a-zA-Z0-9] + boolean variant2 = + ranges[0] == 'a' + && ranges[1] == 'z' + && ranges[2] == 'A' + && ranges[3] == 'Z' + && ranges[4] == '0' + && ranges[5] == '9'; + return variant1 || variant2; + } + return false; + } + /** Check if the charset matches hex digits [0-9a-fA-F]. */ private static boolean isHexDigits(CharSet charset) { List ranges = charset.getRanges(); diff --git a/reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/codegen/StatelessLoopBytecodeGenerator.java b/reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/codegen/StatelessLoopBytecodeGenerator.java index c8a799f..da7e715 100644 --- a/reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/codegen/StatelessLoopBytecodeGenerator.java +++ b/reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/codegen/StatelessLoopBytecodeGenerator.java @@ -259,6 +259,14 @@ private void generateSingleQuantifierMatches(ClassWriter cw, String className) { mv.visitLabel(loopStart); + // Check upper bound BEFORE consuming: if (maxReps != -1 && count >= maxReps) break. + // Must come before pos < len check so that {0} exits immediately without consuming any char. + if (info.maxReps != -1) { + mv.visitVarInsn(ILOAD, countVar); + pushInt(mv, info.maxReps); + mv.visitJumpInsn(IF_ICMPGE, loopEnd); + } + // Check: pos < len mv.visitVarInsn(ILOAD, posVar); mv.visitVarInsn(ILOAD, lenVar); @@ -280,14 +288,6 @@ private void generateSingleQuantifierMatches(ClassWriter cw, String className) { // count++; mv.visitIincInsn(countVar, 1); - // Check maxReps bound if applicable - if (info.maxReps > 0) { - // if (count >= maxReps) break; - mv.visitVarInsn(ILOAD, countVar); - pushInt(mv, info.maxReps); - mv.visitJumpInsn(IF_ICMPGE, loopEnd); - } - mv.visitJumpInsn(GOTO, loopStart); // Character mismatch - return false @@ -445,6 +445,13 @@ private void generateSingleQuantifierFind(ClassWriter cw, String className) { mv.visitLabel(innerLoop); + // Check upper bound before consuming (fixes maxReps=0 case) + if (info.maxReps != -1) { + mv.visitVarInsn(ILOAD, countVar); + pushInt(mv, info.maxReps); + mv.visitJumpInsn(IF_ICMPGE, innerEnd); + } + // Check: pos < len mv.visitVarInsn(ILOAD, posVar); mv.visitVarInsn(ILOAD, lenVar); @@ -465,14 +472,6 @@ private void generateSingleQuantifierFind(ClassWriter cw, String className) { // count++; mv.visitIincInsn(countVar, 1); - // Check maxReps bound if applicable - if (info.maxReps > 0) { - // if (count >= maxReps) break; - mv.visitVarInsn(ILOAD, countVar); - pushInt(mv, info.maxReps); - mv.visitJumpInsn(IF_ICMPGE, innerEnd); - } - mv.visitJumpInsn(GOTO, innerLoop); // Inner loop end - check if we found a match @@ -748,6 +747,13 @@ private void generateSingleQuantifierFindFrom(ClassWriter cw, String className) mv.visitLabel(innerLoop); + // Check upper bound before consuming (fixes maxReps=0 case) + if (info.maxReps != -1) { + mv.visitVarInsn(ILOAD, countVar); + pushInt(mv, info.maxReps); + mv.visitJumpInsn(IF_ICMPGE, innerEnd); + } + // Check: pos < len mv.visitVarInsn(ILOAD, posVar); mv.visitVarInsn(ILOAD, lenVar); @@ -768,14 +774,6 @@ private void generateSingleQuantifierFindFrom(ClassWriter cw, String className) // count++; mv.visitIincInsn(countVar, 1); - // Check maxReps bound if applicable - if (info.maxReps > 0) { - // if (count >= maxReps) break; - mv.visitVarInsn(ILOAD, countVar); - pushInt(mv, info.maxReps); - mv.visitJumpInsn(IF_ICMPGE, innerEnd); - } - mv.visitJumpInsn(GOTO, innerLoop); // Inner loop end - check if we found a match @@ -1252,6 +1250,13 @@ private void generateSingleQuantifierFindBoundsFrom(ClassWriter cw, String class mv.visitLabel(innerLoop); + // Check upper bound before consuming (fixes maxReps=0 case) + if (info.maxReps != -1) { + mv.visitVarInsn(ILOAD, countVar); + pushInt(mv, info.maxReps); + mv.visitJumpInsn(IF_ICMPGE, innerEnd); + } + // Check: pos < len mv.visitVarInsn(ILOAD, posVar); mv.visitVarInsn(ILOAD, lenVar); @@ -1272,14 +1277,6 @@ private void generateSingleQuantifierFindBoundsFrom(ClassWriter cw, String class // count++; mv.visitIincInsn(countVar, 1); - // Check maxReps bound if applicable - if (info.maxReps > 0) { - // if (count >= maxReps) break; - mv.visitVarInsn(ILOAD, countVar); - pushInt(mv, info.maxReps); - mv.visitJumpInsn(IF_ICMPGE, innerEnd); - } - mv.visitJumpInsn(GOTO, innerLoop); // Inner loop end - check if we found a match @@ -1618,9 +1615,16 @@ public void generateFindMatchFromMethod(ClassWriter cw, String className) { // Compute matchEnd based on pattern type if (info.type == StatelessPatternInfo.PatternType.SINGLE_QUANTIFIER) { - // For simple quantifier, find where the character class stops matching + // For simple quantifier, find where the character class stops matching, capped at + // info.maxReps when a finite upper bound exists. Without this cap the greedy scan + // would consume every charset-matching character past the quantifier's upper bound, + // making {5}, {5,7}, {5,99} all behave like {5,}. The cap mirrors the bound check + // already enforced in findFrom / findBoundsFrom for this pattern shape. // matchEnd = matchStart; - // while (matchEnd < input.length() && charsetMatches(input.charAt(matchEnd))) matchEnd++; + // while (matchEnd < input.length() + // && (maxReps < 0 || matchEnd - matchStart < maxReps) + // && charsetMatches(input.charAt(matchEnd))) + // matchEnd++; mv.visitVarInsn(ILOAD, matchStartVar); mv.visitVarInsn(ISTORE, matchEndVar); @@ -1629,6 +1633,17 @@ public void generateFindMatchFromMethod(ClassWriter cw, String className) { Label loopEnd = new Label(); mv.visitLabel(loopStart); + + // Enforce upper bound BEFORE consuming: if maxReps != -1 and (matchEnd - matchStart) >= + // maxReps, break. + if (info.maxReps != -1) { + mv.visitVarInsn(ILOAD, matchEndVar); + mv.visitVarInsn(ILOAD, matchStartVar); + mv.visitInsn(ISUB); + pushInt(mv, info.maxReps); + mv.visitJumpInsn(IF_ICMPGE, loopEnd); + } + // if (matchEnd >= input.length()) break; mv.visitVarInsn(ILOAD, matchEndVar); mv.visitVarInsn(ALOAD, inputVar); diff --git a/reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/parsing/RegexParser.java b/reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/parsing/RegexParser.java index 973bead..3108638 100644 --- a/reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/parsing/RegexParser.java +++ b/reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/parsing/RegexParser.java @@ -287,6 +287,11 @@ private RegexNode parseGroup() throws ParseException { } else if (peek() == '|') { // Branch reset: (?|alt1|alt2) return parseBranchReset(); + } else if (peek() == '>') { + // Atomic group: (?>X). Reggie has no backtracking in its DFA/NFA engines, so the + // backtracking-prevention hint has the same language semantics as a non-capturing group. + consume(); + capturing = false; } else { throw new UnsupportedPatternException( "Unsupported special group construct at position " + pos); @@ -375,6 +380,13 @@ private RegexNode parseCharClass() throws ParseException { } } + if (peek() == '\\' && pos + 1 < pattern.length() && pattern.charAt(pos + 1) == 'Q') { + consume(); // consume '\\' + consume(); // consume 'Q' + ranges.addAll(parseQuotedCharClassRanges()); + continue; + } + char start = parseCharClassChar(); if (peek() == '-' && peekNext() != ']') { @@ -397,6 +409,19 @@ private RegexNode parseCharClass() throws ParseException { return new CharClassNode(charset, negated); } + private List parseQuotedCharClassRanges() { + List quotedRanges = new ArrayList<>(); + while (hasMore()) { + char ch = consume(); + if (ch == '\\' && hasMore() && peek() == 'E') { + consume(); // consume 'E' + break; + } + quotedRanges.add(new CharSet.Range(ch, ch)); + } + return quotedRanges; + } + private CharSet getCharSetForEscape(char escapeChar) { switch (escapeChar) { case 'd': @@ -522,6 +547,9 @@ private RegexNode parseEscape() throws ParseException { case 'g': // PCRE backreference: \g{N}, \g{-N}, or \g{name} return parseGBackreference(); + case 'Q': + // Quoted literal: \Q...\E + return parseQuotedLiteral(); default: // Escaped literal (e.g., \., \*, \+) return new LiteralNode(ch); @@ -715,6 +743,27 @@ private RegexNode parseGBackreference() throws ParseException { } } + /** + * Parse a quoted literal sequence: \Q...\E. Consumes all characters until \E (or end of pattern) + * and returns them as literal AST nodes. + */ + private RegexNode parseQuotedLiteral() { + List parts = new ArrayList<>(); + while (hasMore()) { + char ch = consume(); + if (ch == '\\' && hasMore() && peek() == 'E') { + consume(); // consume 'E' + break; + } + parts.add(new LiteralNode(ch)); + } + + if (parts.isEmpty()) { + return new LiteralNode((char) 0); + } + return parts.size() == 1 ? parts.get(0) : new ConcatNode(parts); + } + private RegexNode parseAnchor() throws ParseException { char ch = consume(); boolean isMultiline = currentModifiers.isMultiline(); diff --git a/reggie-codegen/src/test/java/com/datadoghq/reggie/codegen/analysis/FallbackPatternDetectorTest.java b/reggie-codegen/src/test/java/com/datadoghq/reggie/codegen/analysis/FallbackPatternDetectorTest.java index e872fc6..0fe9a66 100644 --- a/reggie-codegen/src/test/java/com/datadoghq/reggie/codegen/analysis/FallbackPatternDetectorTest.java +++ b/reggie-codegen/src/test/java/com/datadoghq/reggie/codegen/analysis/FallbackPatternDetectorTest.java @@ -62,10 +62,10 @@ void alternationInLookbehindNoFallback() throws Exception { assertNull(detect("(?<=a|b)c")); } - // ── Bug-5 regression: combined lookbehind + lookahead must still trigger fallback ────────── + // ── Bug-5 fixed: combined lookbehind + lookahead no longer triggers blanket fallback ──────── @Test - void lookbehindAndLookaheadCombinedTriggersFallback() throws Exception { - assertNotNull(detect("(?<=\\d)[a-z]+(?=\\s)")); + void lookbehindAndLookaheadCombinedNoFallback() throws Exception { + assertNull(detect("(?<=\\d)[a-z]+(?=\\s)")); } } diff --git a/reggie-codegen/src/test/java/com/datadoghq/reggie/codegen/analysis/LinearTokenSequencePlanTest.java b/reggie-codegen/src/test/java/com/datadoghq/reggie/codegen/analysis/LinearTokenSequencePlanTest.java new file mode 100644 index 0000000..6a08261 --- /dev/null +++ b/reggie-codegen/src/test/java/com/datadoghq/reggie/codegen/analysis/LinearTokenSequencePlanTest.java @@ -0,0 +1,74 @@ +/* + * Copyright 2026-Present Datadog, Inc. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package com.datadoghq.reggie.codegen.analysis; + +import static org.junit.jupiter.api.Assertions.assertEquals; +import static org.junit.jupiter.api.Assertions.assertTrue; + +import com.datadoghq.reggie.codegen.ast.RegexNode; +import com.datadoghq.reggie.codegen.parsing.RegexParser; +import java.util.List; +import org.junit.jupiter.api.Test; + +class LinearTokenSequencePlanTest { + + @Test + void buildsPlanWithOriginalCaptureNumbers() throws Exception { + LinearTokenSequencePlan plan = planFor("(?\\S+) (?[+-]?\\d+)"); + + assertEquals(2, plan.groupCount()); + assertEquals( + List.of( + LinearTokenSequencePlan.OpKind.CAPTURE_NON_SPACE, + LinearTokenSequencePlan.OpKind.LITERAL, + LinearTokenSequencePlan.OpKind.CAPTURE_SIGNED_INTEGER), + plan.ops().stream().map(LinearTokenSequencePlan.Op::kind).toList()); + assertEquals(1, plan.ops().get(0).groupNumber()); + assertEquals(2, plan.ops().get(2).groupNumber()); + } + + @Test + void foldsQuotedDelimiterCaptureIntoSinglePlanOp() throws Exception { + LinearTokenSequencePlan plan = planFor("prefix=\"(?[^\"]*)\" suffix"); + + assertEquals( + List.of( + LinearTokenSequencePlan.OpKind.LITERAL, + LinearTokenSequencePlan.OpKind.CAPTURE_QUOTED_UNTIL_DELIMITER, + LinearTokenSequencePlan.OpKind.LITERAL), + plan.ops().stream().map(LinearTokenSequencePlan.Op::kind).toList()); + assertEquals("prefix=", plan.ops().get(0).literal()); + assertEquals(1, plan.ops().get(1).groupNumber()); + assertEquals('"', plan.ops().get(1).delimiter()); + assertEquals(" suffix", plan.ops().get(2).literal()); + } + + @Test + void failsClosedForGeneralRegexCategories() throws Exception { + PatternCategorization categorization = categorize("(?\\w+)\\s+\\1"); + + assertTrue(LinearTokenSequencePlan.from(categorization).isEmpty()); + } + + private static LinearTokenSequencePlan planFor(String pattern) throws Exception { + return LinearTokenSequencePlan.from(categorize(pattern)).orElseThrow(); + } + + private static PatternCategorization categorize(String pattern) throws Exception { + RegexNode ast = new RegexParser().parse(pattern); + return PatternCategorizer.categorize(ast); + } +} diff --git a/reggie-codegen/src/test/java/com/datadoghq/reggie/codegen/analysis/PatternCategorizerTest.java b/reggie-codegen/src/test/java/com/datadoghq/reggie/codegen/analysis/PatternCategorizerTest.java new file mode 100644 index 0000000..d13f16c --- /dev/null +++ b/reggie-codegen/src/test/java/com/datadoghq/reggie/codegen/analysis/PatternCategorizerTest.java @@ -0,0 +1,105 @@ +/* + * Copyright 2026-Present Datadog, Inc. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package com.datadoghq.reggie.codegen.analysis; + +import static org.junit.jupiter.api.Assertions.assertEquals; +import static org.junit.jupiter.api.Assertions.assertTrue; + +import com.datadoghq.reggie.codegen.ast.RegexNode; +import com.datadoghq.reggie.codegen.parsing.RegexParser; +import java.util.List; +import org.junit.jupiter.api.Test; + +class PatternCategorizerTest { + + @Test + void categorizesLinearDelimitedLogTemplateWithoutGrokNames() throws Exception { + String pattern = + "(?(?:[0-9]{1,3}\\.){3}[0-9]{1,3}|[A-Za-z0-9.-]+) " + + "(?\\S+) " + + "(?\\S+) " + + "\\[(?[^\\]]+)\\]\\s+" + + "\"(?\\b\\w+\\b) (?\\S+) HTTP/(?\\d+\\.\\d+)\" " + + "(?[+-]?\\d+) " + + "(?[+-]?\\d+) " + + "\"(?[^\"]*)\" " + + "(?[+-]?\\d+(?:\\.\\d+)?)" + + ".* \\[(?\\b\\w+\\b)\\] .*"; + + PatternCategorization categorization = categorize(pattern); + + assertEquals(PatternCategorization.Category.LINEAR_TOKEN_SEQUENCE, categorization.category()); + assertTrue(categorization.notes().stream().noneMatch(note -> note.contains("grok"))); + + List capturedKinds = + categorization.atoms().stream() + .filter(PatternAtom::isCaptured) + .map(PatternAtom::kind) + .toList(); + assertEquals( + List.of( + PatternAtom.Kind.IP_OR_HOST, + PatternAtom.Kind.NON_SPACE_PLUS, + PatternAtom.Kind.NON_SPACE_PLUS, + PatternAtom.Kind.UNTIL_DELIMITER, + PatternAtom.Kind.WORD, + PatternAtom.Kind.NON_SPACE_PLUS, + PatternAtom.Kind.DECIMAL_NUMBER, + PatternAtom.Kind.SIGNED_INTEGER, + PatternAtom.Kind.SIGNED_INTEGER, + PatternAtom.Kind.UNTIL_DELIMITER, + PatternAtom.Kind.SIGNED_DECIMAL_NUMBER, + PatternAtom.Kind.BRACKETED_WORD_AFTER_SKIP), + capturedKinds); + + assertTrue( + categorization.atoms().stream() + .anyMatch( + atom -> + atom.kind() == PatternAtom.Kind.UNTIL_DELIMITER + && "timestamp".equals(atom.groupName()) + && atom.delimiter() == ']')); + assertTrue( + categorization.atoms().stream() + .anyMatch( + atom -> + atom.kind() == PatternAtom.Kind.UNTIL_DELIMITER + && "referer".equals(atom.groupName()) + && atom.delimiter() == '"')); + } + + @Test + void rejectsBacktrackingDependentShapes() throws Exception { + PatternCategorization categorization = categorize("(?\\w+)\\s+\\1"); + + assertEquals(PatternCategorization.Category.GENERAL_REGEX, categorization.category()); + assertTrue(categorization.notes().stream().anyMatch(note -> note.contains("backreference"))); + } + + @Test + void rejectsUnsupportedControlFlowShapes() throws Exception { + for (String pattern : List.of("(?\\S+?)-end", "(?=prefix)(?\\S+)")) { + PatternCategorization categorization = categorize(pattern); + + assertEquals(PatternCategorization.Category.GENERAL_REGEX, categorization.category()); + } + } + + private static PatternCategorization categorize(String pattern) throws Exception { + RegexNode ast = new RegexParser().parse(pattern); + return PatternCategorizer.categorize(ast); + } +} diff --git a/reggie-codegen/src/test/java/com/datadoghq/reggie/codegen/analysis/PatternRoutingPropertyTest.java b/reggie-codegen/src/test/java/com/datadoghq/reggie/codegen/analysis/PatternRoutingPropertyTest.java index fbcc31d..b77c54a 100644 --- a/reggie-codegen/src/test/java/com/datadoghq/reggie/codegen/analysis/PatternRoutingPropertyTest.java +++ b/reggie-codegen/src/test/java/com/datadoghq/reggie/codegen/analysis/PatternRoutingPropertyTest.java @@ -218,9 +218,9 @@ static Stream provideDFAExamples() { // DFA_SWITCH (20-300 states) new PatternRoutingTestCase("(a|b|c){50}", DFA_SWITCH, "medium alternation (151 states)"), - // DFA_TABLE (>300 states) + // Large DFA state spaces fall back to NFA instead of generating oversized DFA bytecode. new PatternRoutingTestCase( - "(a|b|c|d|e|f){100}", DFA_TABLE, "very high repetition alternation (601 states)")); + "(a|b|c|d|e|f){100}", OPTIMIZED_NFA, "very high repetition alternation (601 states)")); } @Nested diff --git a/reggie-codegen/src/test/java/com/datadoghq/reggie/codegen/analysis/StrategySelectionTest.java b/reggie-codegen/src/test/java/com/datadoghq/reggie/codegen/analysis/StrategySelectionTest.java index c593650..95a1096 100644 --- a/reggie-codegen/src/test/java/com/datadoghq/reggie/codegen/analysis/StrategySelectionTest.java +++ b/reggie-codegen/src/test/java/com/datadoghq/reggie/codegen/analysis/StrategySelectionTest.java @@ -49,6 +49,31 @@ private PatternAnalyzer.MatchingStrategyResult analyze(String pattern) throws Ex } } + @Test + void largePureDFAUsesTableStrategy() throws Exception { + PatternAnalyzer.MatchingStrategyResult result = analyze("(?:[a-z][0-9]){150}"); + + assertEquals( + PatternAnalyzer.MatchingStrategy.DFA_TABLE, + result.strategy, + "Large pure regular DFAs should use the compact table backend"); + } + + @Test + void negatedCharClassDelimitedCapturesDoNotRequireRecursiveDescent() throws Exception { + RegexParser parser = new RegexParser(); + RegexNode ast = parser.parse("([^ ]+) ([^\" ]*) (HTTP/\\d\\.\\d)"); + NFA nfa = new ThompsonBuilder().build(ast, 2); + + PatternAnalyzer.MatchingStrategyResult result = + new PatternAnalyzer(ast, nfa).analyzeAndRecommend(); + + assertNotEquals( + PatternAnalyzer.MatchingStrategy.RECURSIVE_DESCENT, + result.strategy, + "Negated classes that exclude their following delimiter do not need backtracking"); + } + // ==================== Subroutine Tests ==================== @Test diff --git a/reggie-codegen/src/test/java/com/datadoghq/reggie/codegen/analysis/pbt/PatternRoutingPropertyBasedTest.java b/reggie-codegen/src/test/java/com/datadoghq/reggie/codegen/analysis/pbt/PatternRoutingPropertyBasedTest.java index 0ed7b34..888496d 100644 --- a/reggie-codegen/src/test/java/com/datadoghq/reggie/codegen/analysis/pbt/PatternRoutingPropertyBasedTest.java +++ b/reggie-codegen/src/test/java/com/datadoghq/reggie/codegen/analysis/pbt/PatternRoutingPropertyBasedTest.java @@ -124,18 +124,18 @@ void optionalGroupBackrefsDetectedCorrectly(@ForAll("withOptionalGroupBackrefs") } @Property(tries = 50) // Fewer tries since these are expensive patterns - void largeStateSpacePatternsUseDFATableOrSpecialized(@ForAll("largeStateSpace") String pattern) { + void largeStateSpacePatternsUseNfaFallbackOrSpecialized( + @ForAll("largeStateSpace") String pattern) { PatternAnalyzer.MatchingStrategyResult result = analyze(pattern); - // Patterns with many states should use DFA (SWITCH or TABLE) or specialized strategy - // Note: (a|b|c){50} = 151 states → DFA_SWITCH - // (a|b|c|d|e|f){100} = 601 states → DFA_TABLE or SPECIALIZED_QUANTIFIED_GROUP + // Patterns with many states should use switch-sized DFA, an NFA fallback, or a specialized + // strategy. Note: (a|b|c){50} = 151 states → DFA_SWITCH; + // (a|b|c|d|e|f){100} = 601 states → OPTIMIZED_NFA or SPECIALIZED_QUANTIFIED_GROUP. List validStrategies = List.of( - DFA_SWITCH, // 50-300 states - DFA_TABLE, // >300 states + DFA_SWITCH, // medium state count SPECIALIZED_QUANTIFIED_GROUP, // Might have specialized strategy - OPTIMIZED_NFA // Rare fallback + OPTIMIZED_NFA // Large state-space fallback ); assertTrue( @@ -143,7 +143,7 @@ void largeStateSpacePatternsUseDFATableOrSpecialized(@ForAll("largeStateSpace") () -> "Large state space pattern: '" + pattern - + "' should use DFA_SWITCH/DFA_TABLE/SPECIALIZED_QUANTIFIED_GROUP/OPTIMIZED_NFA, got: " + + "' should use DFA_SWITCH/SPECIALIZED_QUANTIFIED_GROUP/OPTIMIZED_NFA, got: " + result.strategy); } @@ -247,7 +247,6 @@ private boolean isGenericDFAStrategy(PatternAnalyzer.MatchingStrategy strategy) || strategy == DFA_SWITCH || strategy == DFA_SWITCH_WITH_GROUPS || strategy == DFA_SWITCH_WITH_ASSERTIONS - || strategy == DFA_TABLE || strategy == OPTIMIZED_NFA; } diff --git a/reggie-integration-tests/build.gradle b/reggie-integration-tests/build.gradle index c0ca4d8..8c4122b 100644 --- a/reggie-integration-tests/build.gradle +++ b/reggie-integration-tests/build.gradle @@ -30,6 +30,8 @@ test { events "passed", "skipped", "failed" showStandardStreams = true } + // Forward all -Dreggie.* system properties from the Gradle JVM to the test JVM + systemProperties System.properties.findAll { k, v -> k.toString().startsWith("reggie.") } } // Custom task to run correctness validation standalone diff --git a/reggie-integration-tests/src/main/java/com/datadoghq/reggie/integration/fuzz/FuzzRunner.java b/reggie-integration-tests/src/main/java/com/datadoghq/reggie/integration/fuzz/FuzzRunner.java new file mode 100644 index 0000000..d891a29 --- /dev/null +++ b/reggie-integration-tests/src/main/java/com/datadoghq/reggie/integration/fuzz/FuzzRunner.java @@ -0,0 +1,109 @@ +/* + * Copyright 2026-Present Datadog, Inc. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package com.datadoghq.reggie.integration.fuzz; + +import com.datadoghq.reggie.integration.fuzz.RegexFuzzOracle.Finding; +import com.datadoghq.reggie.integration.fuzz.RegexFuzzOracle.Result; +import java.util.ArrayList; +import java.util.List; +import java.util.Random; + +/** + * Driver that pairs a {@link RandomRegexGenerator} with {@link RandomInputGenerator} and runs each + * (pattern, input) through {@link RegexFuzzOracle}. Reports aggregated stats and a deduped list of + * findings. + */ +public final class FuzzRunner { + + public static final class Report { + public final int patternsTried; + public final int patternsSkipped; + public final int inputsChecked; + public final List findings; + + public Report(int patternsTried, int patternsSkipped, int inputsChecked, List f) { + this.patternsTried = patternsTried; + this.patternsSkipped = patternsSkipped; + this.inputsChecked = inputsChecked; + this.findings = f; + } + + public String summary() { + return String.format( + "patterns=%d skipped=%d inputs-checked=%d findings=%d", + patternsTried, patternsSkipped, inputsChecked, findings.size()); + } + } + + /** Builder-style config so test methods can override defaults without long argument lists. */ + public static final class Config { + public long seed = 0xC0DEFEED_DEADBEEFL; + public int patternCount = 500; + public int inputsPerPattern = 8; + public int patternDepth = 3; + public int inputMaxLength = 12; + + /** Cap the number of findings retained per pattern to avoid quadratic-style log explosions. */ + public int findingsPerPatternCap = 3; + } + + public Report run(Config cfg) { + Random patternRng = new Random(cfg.seed); + Random inputRng = new Random(cfg.seed ^ 0x9E3779B97F4A7C15L); + + RandomRegexGenerator regexGen = new RandomRegexGenerator(patternRng, cfg.patternDepth); + RandomInputGenerator inputGen = new RandomInputGenerator(inputRng, cfg.inputMaxLength); + RegexFuzzOracle oracle = new RegexFuzzOracle(); + + int skipped = 0; + int inputs = 0; + List findings = new ArrayList<>(); + + for (int p = 0; p < cfg.patternCount; p++) { + String pattern = regexGen.generate(); + int findingsThisPattern = 0; + boolean patternSkipped = false; + + for (int i = 0; i < cfg.inputsPerPattern; i++) { + String input = inputGen.generate(); + Result result = oracle.check(pattern, input); + + if (result.skipped) { + // Most "skipped" reasons are pattern-level (compile-time rejection from either engine); + // bail on the remaining inputs for this pattern when that's the case. + if (i == 0) { + patternSkipped = true; + break; + } + // Mid-iteration skip — e.g. a runtime throw on a specific input. Record once. + break; + } + inputs++; + + for (Finding f : result.findings) { + if (findingsThisPattern < cfg.findingsPerPatternCap) { + findings.add(f); + findingsThisPattern++; + } + } + } + + if (patternSkipped) skipped++; + } + + return new Report(cfg.patternCount - skipped, skipped, inputs, findings); + } +} diff --git a/reggie-integration-tests/src/main/java/com/datadoghq/reggie/integration/fuzz/RandomInputGenerator.java b/reggie-integration-tests/src/main/java/com/datadoghq/reggie/integration/fuzz/RandomInputGenerator.java new file mode 100644 index 0000000..466c1c3 --- /dev/null +++ b/reggie-integration-tests/src/main/java/com/datadoghq/reggie/integration/fuzz/RandomInputGenerator.java @@ -0,0 +1,49 @@ +/* + * Copyright 2026-Present Datadog, Inc. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package com.datadoghq.reggie.integration.fuzz; + +import java.util.Random; + +/** + * Generates input strings to feed to a pattern. Uses the same small alphabet as {@link + * RandomRegexGenerator} so generated patterns and inputs have a chance of interacting. + */ +public final class RandomInputGenerator { + + private static final char[] ALPHABET = {'a', 'b', 'c', '0', '1', '-', '_', '\n'}; + + private final Random rnd; + private final int maxLength; + + /** + * @param rnd random source. + * @param maxLength inclusive maximum input length. Strings up to this length are sampled; the + * empty string is included. + */ + public RandomInputGenerator(Random rnd, int maxLength) { + this.rnd = rnd; + this.maxLength = maxLength; + } + + public String generate() { + int len = rnd.nextInt(maxLength + 1); + StringBuilder sb = new StringBuilder(len); + for (int i = 0; i < len; i++) { + sb.append(ALPHABET[rnd.nextInt(ALPHABET.length)]); + } + return sb.toString(); + } +} diff --git a/reggie-integration-tests/src/main/java/com/datadoghq/reggie/integration/fuzz/RandomRegexGenerator.java b/reggie-integration-tests/src/main/java/com/datadoghq/reggie/integration/fuzz/RandomRegexGenerator.java new file mode 100644 index 0000000..bfedb72 --- /dev/null +++ b/reggie-integration-tests/src/main/java/com/datadoghq/reggie/integration/fuzz/RandomRegexGenerator.java @@ -0,0 +1,193 @@ +/* + * Copyright 2026-Present Datadog, Inc. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package com.datadoghq.reggie.integration.fuzz; + +import java.util.Random; + +/** + * Grammar-driven random regex generator. Produces syntactically valid patterns over a small + * alphabet, bounded by a depth parameter, suitable for property-based testing against an oracle + * (typically {@link java.util.regex.Pattern}). Deterministic given a seed. + * + *

Scope is deliberately narrow: simple atoms (literal chars, character classes), quantifiers, + * concat/alternation, anchors, groups, and a single backreference when a group is in scope. No + * lookarounds, no PCRE-only features — those have their own existing test corpora and would widen + * the surface beyond what we can usefully oracle-check. + */ +public final class RandomRegexGenerator { + + /** Characters available to literals and character classes. Small alphabet keeps inputs dense. */ + private static final char[] ALPHABET = {'a', 'b', 'c', '0', '1', '-', '_'}; + + private final Random rnd; + private final int maxDepth; + + /** + * @param rnd the random source; pass a seeded {@link Random} for reproducibility. + * @param maxDepth maximum recursion depth for nested groups/alternation/concat. 3 is a sensible + * default — deeper trees produce patterns that strain the JDK parser more than they reveal + * Reggie bugs. + */ + public RandomRegexGenerator(Random rnd, int maxDepth) { + this.rnd = rnd; + this.maxDepth = maxDepth; + } + + /** Generate a top-level pattern. */ + public String generate() { + StringBuilder sb = new StringBuilder(); + // 0 groups in scope yet; the genAlt path opens groups as it recurses. + genAlt(sb, maxDepth, 0); + return sb.toString(); + } + + /** Alternation: child[|child]* */ + private int genAlt(StringBuilder sb, int depth, int groupsInScope) { + int branches = depth <= 0 ? 1 : 1 + rnd.nextInt(2); // 1 or 2 branches when depth allows + for (int i = 0; i < branches; i++) { + if (i > 0) sb.append('|'); + groupsInScope = genConcat(sb, depth - 1, groupsInScope); + } + return groupsInScope; + } + + /** Concatenation: 1-3 atoms in sequence. */ + private int genConcat(StringBuilder sb, int depth, int groupsInScope) { + int parts = 1 + rnd.nextInt(3); + for (int i = 0; i < parts; i++) { + groupsInScope = genAtom(sb, depth, groupsInScope); + } + return groupsInScope; + } + + /** + * A single atom, possibly with a quantifier. Atoms: literal, char class, dot, anchor, group, + * backreference. Depth gates whether group/backref can recurse. + */ + private int genAtom(StringBuilder sb, int depth, int groupsInScope) { + int kind = rnd.nextInt(100); + // Probabilities are eyeballed to keep generated patterns mostly satisfiable. + if (kind < 30) { + sb.append(literal()); + } else if (kind < 50) { + sb.append(charClass()); + } else if (kind < 60) { + sb.append('.'); + } else if (kind < 75) { + sb.append(anchor()); + } else if (depth > 0 && kind < 90) { + // Group: capturing 70% of the time, non-capturing 30%. + // Children inside the group can reference outer groups (groupsInScope) but NOT this + // group itself — a backref to a group that encloses the backref site is semantically + // pathological (the group hasn't captured yet), and JDK / Reggie disagree on its + // meaning. The fuzz oracle treats this as accepted divergence; stop generating it. + boolean capturing = rnd.nextInt(10) < 7; + sb.append(capturing ? "(" : "(?:"); + int after = genAlt(sb, depth - 1, groupsInScope); + sb.append(')'); + // A capturing group adds one to the count from the caller's perspective. + groupsInScope = capturing ? Math.max(groupsInScope + 1, after) : after; + } else if (groupsInScope > 0 && kind < 95) { + // Backreference to an already-opened group. + int target = 1 + rnd.nextInt(groupsInScope); + sb.append('\\').append(target); + } else { + // Fallback: another literal. + sb.append(literal()); + } + // Apply a quantifier sometimes. Anchors and backrefs at the end already wrote their own + // representation; quantifying them is legal in Java regex syntax even if degenerate. + int qkind = rnd.nextInt(10); + if (qkind < 4) { + sb.append(quantifier()); + } + return groupsInScope; + } + + /** A single literal char from {@link #ALPHABET}, regex-escaped where necessary. */ + private String literal() { + char c = ALPHABET[rnd.nextInt(ALPHABET.length)]; + // '-' is fine outside character classes; '_' too. Nothing in our alphabet needs escaping + // at the top level, but be explicit so a future alphabet expansion does not silently break. + return String.valueOf(c); + } + + /** A character class like {@code [abc]} / {@code [a-c]} / {@code [^abc]}. */ + private String charClass() { + StringBuilder cls = new StringBuilder("["); + if (rnd.nextInt(4) == 0) cls.append('^'); // negated 25% of the time + int items = 1 + rnd.nextInt(3); + for (int i = 0; i < items; i++) { + if (rnd.nextInt(3) == 0) { + // Range + char a = ALPHABET[rnd.nextInt(ALPHABET.length)]; + char b = ALPHABET[rnd.nextInt(ALPHABET.length)]; + char low = (char) Math.min(a, b); + char high = (char) Math.max(a, b); + cls.append(low).append('-').append(high); + } else { + cls.append(ALPHABET[rnd.nextInt(ALPHABET.length)]); + } + } + cls.append(']'); + return cls.toString(); + } + + /** A zero-width anchor: ^, $, \A, \Z, \z. */ + private String anchor() { + switch (rnd.nextInt(5)) { + case 0: + return "^"; + case 1: + return "$"; + case 2: + return "\\A"; + case 3: + return "\\Z"; + default: + return "\\z"; + } + } + + /** A quantifier suffix: ?, *, +, {n}, {n,}, {n,m}, with an optional lazy modifier. */ + private String quantifier() { + String base; + switch (rnd.nextInt(6)) { + case 0: + base = "?"; + break; + case 1: + base = "*"; + break; + case 2: + base = "+"; + break; + case 3: + base = "{" + rnd.nextInt(5) + "}"; + break; + case 4: + base = "{" + rnd.nextInt(4) + ",}"; + break; + default: + int lo = rnd.nextInt(4); + int hi = lo + rnd.nextInt(4); + base = "{" + lo + "," + hi + "}"; + break; + } + // 20% chance to make it lazy. + return rnd.nextInt(5) == 0 ? base + "?" : base; + } +} diff --git a/reggie-integration-tests/src/main/java/com/datadoghq/reggie/integration/fuzz/RegexFuzzOracle.java b/reggie-integration-tests/src/main/java/com/datadoghq/reggie/integration/fuzz/RegexFuzzOracle.java new file mode 100644 index 0000000..c79df4d --- /dev/null +++ b/reggie-integration-tests/src/main/java/com/datadoghq/reggie/integration/fuzz/RegexFuzzOracle.java @@ -0,0 +1,173 @@ +/* + * Copyright 2026-Present Datadog, Inc. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package com.datadoghq.reggie.integration.fuzz; + +import com.datadoghq.reggie.Reggie; +import com.datadoghq.reggie.runtime.MatchResult; +import com.datadoghq.reggie.runtime.ReggieMatcher; +import java.util.ArrayList; +import java.util.List; +import java.util.regex.Matcher; +import java.util.regex.Pattern; +import java.util.regex.PatternSyntaxException; + +/** + * Compares Reggie's matching behaviour against {@link java.util.regex.Pattern} for a single pattern + * + input pair. The JDK is the oracle; any divergence is reported as a {@link Finding}. + * + *

Reasons for non-comparable cases (skipped, not failed): the JDK rejects the pattern, Reggie + * rejects the pattern, or either engine throws at match time. Only well-typed inputs produce a + * comparison. + */ +public final class RegexFuzzOracle { + + /** A divergence between Reggie and the JDK. Self-contained so it can be logged or rerun. */ + public static final class Finding { + public final String pattern; + public final String input; + public final String description; + + public Finding(String pattern, String input, String description) { + this.pattern = pattern; + this.input = input; + this.description = description; + } + + @Override + public String toString() { + return String.format("pattern=%s input=%s: %s", escape(pattern), escape(input), description); + } + } + + /** Outcome of running the oracle on a single (pattern, input) pair. */ + public static final class Result { + public final boolean skipped; + public final String skipReason; // non-null when skipped + public final List findings; + + private Result(boolean skipped, String skipReason, List findings) { + this.skipped = skipped; + this.skipReason = skipReason; + this.findings = findings; + } + + static Result skipped(String reason) { + return new Result(true, reason, List.of()); + } + + static Result ran(List findings) { + return new Result(false, null, findings); + } + } + + /** + * Run the comparison. Returns a {@link Result} carrying any divergences. Never throws; if + * compilation or matching blows up unexpectedly in either engine the pair is skipped. + */ + public Result check(String pattern, String input) { + Pattern jdk; + try { + jdk = Pattern.compile(pattern); + } catch (PatternSyntaxException e) { + return Result.skipped("JDK rejected pattern: " + e.getDescription()); + } + + ReggieMatcher reggie; + try { + reggie = Reggie.compile(pattern); + } catch (Throwable t) { + return Result.skipped( + "Reggie rejected pattern: " + t.getClass().getSimpleName() + ": " + t.getMessage()); + } + List findings = new ArrayList<>(); + + // matches() — anchored full-input match + try { + boolean jdkMatches = jdk.matcher(input).matches(); + boolean reggieMatches = reggie.matches(input); + if (jdkMatches != reggieMatches) { + findings.add( + new Finding( + pattern, + input, + String.format("matches() differs: jdk=%s reggie=%s", jdkMatches, reggieMatches))); + } + } catch (Throwable t) { + return Result.skipped("matches() threw: " + t); + } + + // findMatch() — leftmost match + try { + Matcher jm = jdk.matcher(input); + boolean jdkFound = jm.find(); + MatchResult rm = reggie.findMatch(input); + boolean reggieFound = rm != null; + if (jdkFound != reggieFound) { + findings.add( + new Finding( + pattern, + input, + String.format("find() boolean differs: jdk=%s reggie=%s", jdkFound, reggieFound))); + } else if (jdkFound) { + // Spans must agree. + if (jm.start() != rm.start() || jm.end() != rm.end()) { + findings.add( + new Finding( + pattern, + input, + String.format( + "first-match span differs: jdk=[%d,%d) reggie=[%d,%d)", + jm.start(), jm.end(), rm.start(), rm.end()))); + } + } + } catch (Throwable t) { + return Result.skipped("find() threw: " + t); + } + + return Result.ran(findings); + } + + private static String escape(String s) { + StringBuilder sb = new StringBuilder("\""); + for (char c : s.toCharArray()) { + switch (c) { + case '\n': + sb.append("\\n"); + break; + case '\r': + sb.append("\\r"); + break; + case '\t': + sb.append("\\t"); + break; + case '"': + sb.append("\\\""); + break; + case '\\': + sb.append("\\\\"); + break; + default: + if (c < 0x20 || c > 0x7e) { + sb.append(String.format("\\u%04x", (int) c)); + } else { + sb.append(c); + } + } + } + sb.append('"'); + return sb.toString(); + } +} diff --git a/reggie-integration-tests/src/main/java/com/datadoghq/reggie/integration/fuzz/RegexFuzzShrinker.java b/reggie-integration-tests/src/main/java/com/datadoghq/reggie/integration/fuzz/RegexFuzzShrinker.java new file mode 100644 index 0000000..fb9ec8f --- /dev/null +++ b/reggie-integration-tests/src/main/java/com/datadoghq/reggie/integration/fuzz/RegexFuzzShrinker.java @@ -0,0 +1,110 @@ +/* + * Copyright 2026-Present Datadog, Inc. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package com.datadoghq.reggie.integration.fuzz; + +import com.datadoghq.reggie.integration.fuzz.RegexFuzzOracle.Finding; +import com.datadoghq.reggie.integration.fuzz.RegexFuzzOracle.Result; + +/** + * Reduces a divergent (pattern, input) pair to a smaller pair that still diverges between Reggie + * and JDK. The shrinker is dumb on purpose — single-char deletions, iterated to a fixpoint — which + * is enough to take 30-char findings down to 4-6 chars in a few hundred milliseconds. + * + *

"Still diverges" is defined as: oracle returns at least one finding whose description + * starts with the same kind as the original (e.g. {@code "matches() differs"}, {@code "find() + * boolean differs"}, {@code "first-match span differs"}). We deliberately ignore the specific + * numeric span — the shrunk pattern matches a different input, so spans will differ. + */ +public final class RegexFuzzShrinker { + + private final RegexFuzzOracle oracle = new RegexFuzzOracle(); + + /** Result of shrinking. Always returns a valid divergent pair. */ + public static final class Shrunk { + public final String pattern; + public final String input; + public final String findingKind; + + public Shrunk(String pattern, String input, String findingKind) { + this.pattern = pattern; + this.input = input; + this.findingKind = findingKind; + } + } + + public Shrunk shrink(Finding original) { + String kind = findingKind(original.description); + String pattern = original.pattern; + String input = original.input; + + boolean changed = true; + while (changed) { + changed = false; + // Try shrinking the input first (always safe to delete chars). + final String capturedPattern = pattern; + String shorterInput = + tryShrinkString(input, s -> stillDivergesSameKind(capturedPattern, s, kind)); + if (!shorterInput.equals(input)) { + input = shorterInput; + changed = true; + } + // Then shrink the pattern. Deleting a char may produce something the JDK rejects; + // tryShrinkString re-runs the oracle which handles that gracefully. + final String capturedInput = input; + String shorterPattern = + tryShrinkString(pattern, p -> stillDivergesSameKind(p, capturedInput, kind)); + if (!shorterPattern.equals(pattern)) { + pattern = shorterPattern; + changed = true; + } + } + return new Shrunk(pattern, input, kind); + } + + /** + * Greedy single-pass: try deleting each char left-to-right; keep deletions that still diverge. + */ + private static String tryShrinkString(String s, java.util.function.Predicate stillBad) { + StringBuilder cur = new StringBuilder(s); + int i = 0; + while (i < cur.length()) { + char removed = cur.charAt(i); + cur.deleteCharAt(i); + if (stillBad.test(cur.toString())) { + // Keep the deletion; do NOT advance i — the char at position i is new. + } else { + cur.insert(i, removed); + i++; + } + } + return cur.toString(); + } + + private boolean stillDivergesSameKind(String pattern, String input, String kind) { + Result r = oracle.check(pattern, input); + if (r.skipped) return false; + for (Finding f : r.findings) { + if (findingKind(f.description).equals(kind)) return true; + } + return false; + } + + /** First word(s) of a finding description, used as the equivalence class for shrinking. */ + private static String findingKind(String description) { + int colon = description.indexOf(':'); + return colon < 0 ? description : description.substring(0, colon); + } +} diff --git a/reggie-integration-tests/src/test/java/com/datadoghq/reggie/integration/AlgorithmicFuzzTest.java b/reggie-integration-tests/src/test/java/com/datadoghq/reggie/integration/AlgorithmicFuzzTest.java new file mode 100644 index 0000000..8bc969a --- /dev/null +++ b/reggie-integration-tests/src/test/java/com/datadoghq/reggie/integration/AlgorithmicFuzzTest.java @@ -0,0 +1,124 @@ +/* + * Copyright 2026-Present Datadog, Inc. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package com.datadoghq.reggie.integration; + +import static org.junit.jupiter.api.Assertions.assertTrue; + +import com.datadoghq.reggie.integration.fuzz.FuzzRunner; +import com.datadoghq.reggie.integration.fuzz.RegexFuzzOracle.Finding; +import com.datadoghq.reggie.integration.fuzz.RegexFuzzShrinker; +import com.datadoghq.reggie.integration.fuzz.RegexFuzzShrinker.Shrunk; +import java.util.LinkedHashMap; +import java.util.Map; +import java.util.concurrent.TimeUnit; +import org.junit.jupiter.api.Test; +import org.junit.jupiter.api.Timeout; + +/** + * Grammar-driven fuzz test cross-checking Reggie against {@link java.util.regex.Pattern}. + * + *

The {@code @Test} method below runs a small deterministic sample on every {@code check}, so CI + * catches new regressions cheaply. Larger sweeps are gated behind the {@code reggie.fuzz.size} + * system property to keep day-to-day test runs fast. + * + *

Findings are printed but the test does not fail automatically — Reggie has known + * pre-existing divergences from JDK semantics, and a noisy assertion would obscure real + * regressions. The {@code maxFindings} guard exists to detect runaway regressions: if a new bug + * suddenly produces thousands of findings in the default 500-pattern sweep, the test fails. + */ +public class AlgorithmicFuzzTest { + + private static final long BASE_SEED = 0xC0DEFEED_DEADBEEFL; + + @Test + @Timeout(value = 300, unit = TimeUnit.SECONDS) + public void smokeFuzz_smallDeterministicSweep() { + FuzzRunner.Config cfg = new FuzzRunner.Config(); + cfg.seed = BASE_SEED; + cfg.patternCount = sizedPatternCount(2000); + cfg.inputsPerPattern = 8; + cfg.patternDepth = 3; + cfg.inputMaxLength = 12; + + FuzzRunner.Report report = new FuzzRunner().run(cfg); + System.out.println("[algorithmic-fuzz] " + report.summary()); + + // Shrink each finding and dedupe by (shrunk pattern, shrunk input, kind). Raw findings are + // often 30-char patterns reproducing the same underlying bug at different sizes; shrinking + // collapses them to a handful of unique minimal repros that can be triaged directly. + RegexFuzzShrinker shrinker = new RegexFuzzShrinker(); + Map uniqueShrunk = new LinkedHashMap<>(); + int shrunk = 0; + int shrinkLimit = Math.min(report.findings.size(), 80); // bound CPU on enormous reports + for (Finding f : report.findings) { + if (shrunk >= shrinkLimit) break; + Shrunk s = shrinker.shrink(f); + String key = s.findingKind + "||" + s.pattern + "||" + s.input; + uniqueShrunk.putIfAbsent(key, s); + shrunk++; + } + System.out.println( + "[algorithmic-fuzz] shrunk " + + shrunk + + " findings -> " + + uniqueShrunk.size() + + " unique minimal repros"); + + int printed = 0; + for (Shrunk s : uniqueShrunk.values()) { + if (printed >= 40) { + System.out.println( + "[algorithmic-fuzz] ... and " + (uniqueShrunk.size() - 40) + " more unique repros"); + break; + } + System.out.println( + "[algorithmic-fuzz-repro] " + + s.findingKind + + ": pattern=" + + s.pattern + + " input=" + + s.input); + printed++; + } + + // Backstop: if the divergence count blows up beyond a ceiling, fail. This is a + // regression-detection guard, not a quality target — tightened from 25% to 10% after + // Cat E/F anchor-condition-dilution fixes were confirmed. + int ceiling = (int) (cfg.patternCount * cfg.inputsPerPattern * 0.10); + assertTrue( + report.findings.size() < ceiling, + "Fuzz produced " + + report.findings.size() + + " findings (> ceiling " + + ceiling + + "). Look at the printed findings; this is likely a regression."); + } + + /** + * Allow CI / local invocations to scale the sweep up via {@code -Dreggie.fuzz.size=...}. The + * value is interpreted as a pattern count; a value of 0 keeps the default. + */ + private static int sizedPatternCount(int dflt) { + String prop = System.getProperty("reggie.fuzz.size"); + if (prop == null || prop.isEmpty()) return dflt; + try { + int v = Integer.parseInt(prop); + return v > 0 ? v : dflt; + } catch (NumberFormatException nfe) { + return dflt; + } + } +} diff --git a/reggie-integration-tests/src/test/java/com/datadoghq/reggie/integration/DollarAnchorCacheDiagTest.java b/reggie-integration-tests/src/test/java/com/datadoghq/reggie/integration/DollarAnchorCacheDiagTest.java new file mode 100644 index 0000000..dd1b9de --- /dev/null +++ b/reggie-integration-tests/src/test/java/com/datadoghq/reggie/integration/DollarAnchorCacheDiagTest.java @@ -0,0 +1,226 @@ +/* + * Copyright 2026-Present Datadog, Inc. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package com.datadoghq.reggie.integration; + +import com.datadoghq.reggie.Reggie; +import com.datadoghq.reggie.integration.fuzz.RandomRegexGenerator; +import com.datadoghq.reggie.runtime.MatchResult; +import com.datadoghq.reggie.runtime.ReggieMatcher; +import com.datadoghq.reggie.runtime.RuntimeCompiler; +import java.util.Random; +import java.util.regex.Matcher; +import java.util.regex.Pattern; +import org.junit.jupiter.api.Test; + +/** + * Diagnoses whether $ anchor patterns fail due to structural-cache collisions when compiled after + * the same 2000-pattern fuzz sweep that the AlgorithmicFuzzTest uses. + */ +public class DollarAnchorCacheDiagTest { + + private static final long BASE_SEED = 0xC0DEFEED_DEADBEEFL; + + @Test + void remainingReprosDiagnosticAfterSweep() { + // Run the same 2000-pattern sweep as AlgorithmicFuzzTest, then check repro patterns. + RuntimeCompiler.clearCache(); + Random rng = new Random(BASE_SEED); + com.datadoghq.reggie.integration.fuzz.RandomRegexGenerator gen = + new com.datadoghq.reggie.integration.fuzz.RandomRegexGenerator(rng, 3); + for (int i = 0; i < 2000; i++) { + try { + Reggie.compile(gen.generate()); + } catch (Exception ignored) { + } + } + System.out.println("[sweep-diag] cache size after sweep: " + RuntimeCompiler.cacheSize()); + // Check if problematic patterns were generated during the sweep + for (String probe : + new String[] {"[^1]\\Z|-", "(.)\\1+", "(-)a|[^b](?:\\1[_-b][-]|1{2}\\1{0})"}) { + boolean inCache = RuntimeCompiler.cachedPatterns().contains(probe); + System.out.println("[sweep-diag] pattern-in-L1-cache: " + probe + " = " + inCache); + } + doRemainingReprosDiag("sweep"); + } + + @Test + void remainingReprosDiagnostic() { + RuntimeCompiler.clearCache(); + doRemainingReprosDiag("fresh"); + } + + private static void doRemainingReprosDiag(String tag) { + com.datadoghq.reggie.integration.fuzz.RegexFuzzOracle oracle = + new com.datadoghq.reggie.integration.fuzz.RegexFuzzOracle(); + String[][] cases = { + {"[^1]\\Z|-", ""}, + {"[^1]\\Z|-", "\n"}, + {"([0]?-*).(1{3}|-\\1)", "0-"}, + {"()(\\1|1{2}1{0})", ""}, + {"(-)a|[^b](?:\\1[_-b][-]|1{2}\\1{0})", ""}, + {"(.)\\1+", ""}, + {"(]{1})(1{0})|(\\1{2})[-]", "-"}, + {"(]{1})(1{0})|(\\1{2})[0]", "0"}, + {"_*0|(a{2}|-+){3,}Z", "0"}, + {"[--a]c?()|([^a]\\1)\\1+", "b0"}, + {"(){1}|.(\\1)", "1"}, + {"(){1}(\\1)", ""}, + {"(){1}|.(\\1)", "-"}, + }; + for (String[] tc : cases) { + com.datadoghq.reggie.integration.fuzz.RegexFuzzOracle.Result r = oracle.check(tc[0], tc[1]); + String cls = "?"; + try { + cls = Reggie.compile(tc[0]).getClass().getSimpleName(); + } catch (Exception ignored) { + } + System.out.printf( + "[%s-diag] pat=%-35s inp=%s class=%-35s skipped=%s findings=%d%n", + tag, tc[0], "\"" + tc[1].replace("\n", "\\n") + "\"", cls, r.skipped, r.findings.size()); + for (var f : r.findings) System.out.println(" -> " + f.description); + } + } + + @Test + void dollarPatternsOracleCheck() { + // Directly call the oracle (which calls matches() + findMatch() in sequence) + // on the exact repro patterns after the fuzz sweep. + RuntimeCompiler.clearCache(); + Random rng = new Random(BASE_SEED); + RandomRegexGenerator gen = new RandomRegexGenerator(rng, 3); + for (int i = 0; i < 2000; i++) { + try { + Reggie.compile(gen.generate()); + } catch (Exception ignored) { + } + } + System.out.println("[oracle-diag] cache after sweep: " + RuntimeCompiler.cacheSize()); + + com.datadoghq.reggie.integration.fuzz.RegexFuzzOracle oracle = + new com.datadoghq.reggie.integration.fuzz.RegexFuzzOracle(); + String[][] cases = { + {"c$", "c"}, {".$", "b"}, {"[b]${1}", "b"}, {"$", "c"}, {"a?$", ""}, + {".{0}$", ""}, {"${1}", ""}, {"Z{1}|$", ""}, {"0|${1}", ""}, {"$", ""}, + }; + boolean anyFail = false; + for (String[] tc : cases) { + com.datadoghq.reggie.integration.fuzz.RegexFuzzOracle.Result r = oracle.check(tc[0], tc[1]); + boolean hasFinding = !r.skipped && !r.findings.isEmpty(); + System.out.printf( + "[oracle-diag] pat=%-20s inp=%-5s skipped=%s findings=%d%n", + tc[0], "\"" + tc[1] + "\"", r.skipped, r.findings.size()); + for (var f : r.findings) System.out.println(" -> " + f.description); + if (hasFinding) anyFail = true; + } + if (anyFail) { + throw new AssertionError("Oracle reports findings for $ patterns — see stdout"); + } + } + + @Test + void dollarPatternsWorkAfterFuzzSweep() { + // Start with a clean cache, then populate it the same way the fuzz test does. + RuntimeCompiler.clearCache(); + + Random rng = new Random(BASE_SEED); + RandomRegexGenerator gen = new RandomRegexGenerator(rng, 3); + for (int i = 0; i < 2000; i++) { + String pat = gen.generate(); + try { + Reggie.compile(pat); + } catch (Exception ignored) { + // Some patterns may be rejected; that's fine. + } + } + + System.out.println("[diag] cache size after sweep: " + RuntimeCompiler.cacheSize()); + + // Now test the failing $ patterns — calling matches() FIRST (like the oracle does), + // then findMatch(). If the matcher has mutable NFA state, matches() may corrupt it. + String[][] cases = { + {"c$", "c"}, + {".$", "b"}, + {"[b]${1}", "b"}, + {"$", "c"}, + {"a?$", ""}, + {".{0}$", ""}, + {"${1}", ""}, + {"Z{1}|$", ""}, + {"0|${1}", ""}, + {"$c?", ""}, + {"$", ""}, + {"a?$", ""}, + }; + + boolean anyFail = false; + for (String[] tc : cases) { + String pat = tc[0], inp = tc[1]; + Pattern jdk = Pattern.compile(pat); + Matcher jm = jdk.matcher(inp); + boolean jdkFound = jm.find(); + + ReggieMatcher rm = Reggie.compile(pat); + // Call matches() first, like the oracle does — this may corrupt NFA state + rm.matches(inp); + MatchResult r = rm.findMatch(inp); + boolean reggieFound = r != null; + + boolean ok = + (jdkFound == reggieFound) + && (!jdkFound || (jm.start() == r.start() && jm.end() == r.end())); + + System.out.printf( + "%s pat=%-20s inp=%-5s jdk=%s reggie=%s class=%s%n", + ok ? "OK " : "FAIL", + pat, + "\"" + inp + "\"", + jdkFound ? "[" + jm.start() + "," + jm.end() + ")" : "null", + reggieFound ? "[" + r.start() + "," + r.end() + ")" : "null", + rm.getClass().getSimpleName()); + + if (!ok) anyFail = true; + } + + if (anyFail) { + throw new AssertionError("$ anchor pattern(s) failed after fuzz sweep — see stdout above"); + } + } + + @Test + void backrefEmptyGroupDirectTest() { + RuntimeCompiler.clearCache(); + String[][] cases = { + {"(){1}(\\1)", ""}, + {"()(\\1|1{2}1{0})", ""}, + {"(){1}|.(\\1)", "1"}, + {"([0]?-*).(1{3}|-\\1)", "0-"}, + }; + for (String[] tc : cases) { + String pat = tc[0], inp = tc[1]; + Pattern jdk = Pattern.compile(pat); + ReggieMatcher rm = Reggie.compile(pat); + boolean jdkM = jdk.matcher(inp).matches(); + boolean reggieM = rm.matches(inp); + Matcher jm = jdk.matcher(inp); + boolean jdkF = jm.find(); + MatchResult rmr = rm.findMatch(inp); + boolean reggieF = rmr != null; + System.out.printf( + "[backref-diag] pat=%-30s inp=%-5s class=%-20s%n matches: jdk=%s reg=%s find: jdk=%s reg=%s%n", + pat, "\"" + inp + "\"", rm.getClass().getSimpleName(), jdkM, reggieM, jdkF, reggieF); + } + } +} diff --git a/reggie-processor/src/main/java/com/datadoghq/reggie/processor/ReggieMatcherBytecodeGenerator.java b/reggie-processor/src/main/java/com/datadoghq/reggie/processor/ReggieMatcherBytecodeGenerator.java index 507873c..8bc82db 100644 --- a/reggie-processor/src/main/java/com/datadoghq/reggie/processor/ReggieMatcherBytecodeGenerator.java +++ b/reggie-processor/src/main/java/com/datadoghq/reggie/processor/ReggieMatcherBytecodeGenerator.java @@ -32,6 +32,7 @@ import com.datadoghq.reggie.codegen.codegen.BackreferenceBytecodeGenerator; import com.datadoghq.reggie.codegen.codegen.BoundedQuantifierBytecodeGenerator; import com.datadoghq.reggie.codegen.codegen.DFASwitchBytecodeGenerator; +import com.datadoghq.reggie.codegen.codegen.DFATableBytecodeGenerator; import com.datadoghq.reggie.codegen.codegen.DFAUnrolledBytecodeGenerator; import com.datadoghq.reggie.codegen.codegen.FixedRepetitionBackrefBytecodeGenerator; import com.datadoghq.reggie.codegen.codegen.FixedSequenceBytecodeGenerator; @@ -118,8 +119,11 @@ public byte[] generate() throws Exception { strategy == PatternAnalyzer.MatchingStrategy.OPTIMIZED_NFA || strategy == PatternAnalyzer.MatchingStrategy.OPTIMIZED_NFA_WITH_BACKREFS || strategy == PatternAnalyzer.MatchingStrategy.OPTIMIZED_NFA_WITH_LOOKAROUND - || strategy == PatternAnalyzer.MatchingStrategy.HYBRID_DFA_LOOKAHEAD; - generateConstructor(cw, needsNFAState, nfa, nameMap); + || strategy == PatternAnalyzer.MatchingStrategy.HYBRID_DFA_LOOKAHEAD + || strategy == PatternAnalyzer.MatchingStrategy.SPECIALIZED_MULTIPLE_LOOKAHEADS + || strategy == PatternAnalyzer.MatchingStrategy.SPECIALIZED_LITERAL_LOOKAHEADS; + boolean needsRecursiveDescent = strategy == PatternAnalyzer.MatchingStrategy.RECURSIVE_DESCENT; + generateConstructor(cw, needsNFAState, needsRecursiveDescent, nfa, nameMap); // Generate methods based on strategy switch (strategy) { @@ -235,6 +239,7 @@ public byte[] generate() throws Exception { switchGen.generateFindFromMethod(cw, getJavaClassName()); switchGen.generateMatchesAtStartMethod(cw); switchGen.generateMatchMethod(cw, getJavaClassName()); + switchGen.generateMatchIntoMethod(cw, getJavaClassName()); switchGen.generateMatchesBoundedMethod(cw, getJavaClassName()); switchGen.generateMatchBoundedMethod(cw, getJavaClassName()); switchGen.generateFindMatchMethod(cw, getJavaClassName()); @@ -243,9 +248,19 @@ public byte[] generate() throws Exception { break; case DFA_TABLE: - // TODO: Implement table-driven DFA generator - throw new UnsupportedOperationException( - "DFA_TABLE bytecode generation not yet implemented."); + DFATableBytecodeGenerator tableGen = new DFATableBytecodeGenerator(dfa); + tableGen.generateStaticData(cw, getJavaClassName()); + tableGen.generateMatchesMethod(cw, getJavaClassName()); + tableGen.generateFindMethod(cw, getJavaClassName()); + tableGen.generateFindFromMethod(cw, getJavaClassName()); + tableGen.generateMatchMethod(cw, getJavaClassName()); + tableGen.generateMatchIntoMethod(cw, getJavaClassName()); + tableGen.generateMatchesBoundedMethod(cw, getJavaClassName()); + tableGen.generateMatchBoundedMethod(cw, getJavaClassName()); + tableGen.generateFindMatchMethod(cw, getJavaClassName()); + tableGen.generateFindMatchFromMethod(cw, getJavaClassName()); + tableGen.generateFindBoundsFromMethod(cw, getJavaClassName()); + break; case HYBRID_DFA_LOOKAHEAD: case SPECIALIZED_MULTIPLE_LOOKAHEADS: // Tier 3: Same as HYBRID but with 2+ lookaheads @@ -265,6 +280,7 @@ public byte[] generate() throws Exception { hybridGen.generateFindMethod(cw, getJavaClassName()); hybridGen.generateFindFromMethod(cw, getJavaClassName()); hybridGen.generateMatchMethod(cw, getJavaClassName()); + hybridGen.generateMatchIntoMethod(cw, getJavaClassName()); hybridGen.generateMatchBoundedMethod(cw, getJavaClassName()); hybridGen.generateFindMatchMethod(cw, getJavaClassName()); hybridGen.generateFindMatchFromMethod(cw, getJavaClassName()); @@ -286,6 +302,7 @@ public byte[] generate() throws Exception { plainNfaGen.generateFindMethod(cw, getJavaClassName()); plainNfaGen.generateFindFromMethod(cw, getJavaClassName()); plainNfaGen.generateMatchMethod(cw, getJavaClassName()); + plainNfaGen.generateMatchIntoMethod(cw, getJavaClassName()); plainNfaGen.generateMatchBoundedMethod(cw, getJavaClassName()); plainNfaGen.generateFindMatchMethod(cw, getJavaClassName()); plainNfaGen.generateFindMatchFromMethod(cw, getJavaClassName()); @@ -334,6 +351,7 @@ public byte[] generate() throws Exception { nfaGen.generateFindMethod(cw, getJavaClassName()); nfaGen.generateFindFromMethod(cw, getJavaClassName()); nfaGen.generateMatchMethod(cw, getJavaClassName()); + nfaGen.generateMatchIntoMethod(cw, getJavaClassName()); nfaGen.generateMatchBoundedMethod(cw, getJavaClassName()); nfaGen.generateMatchesBoundedMethod(cw, getJavaClassName()); nfaGen.generateMatchBoundedCharSequenceMethod(cw, getJavaClassName()); @@ -359,6 +377,7 @@ public byte[] generate() throws Exception { nfaGen.generateFindMethod(cw, getJavaClassName()); nfaGen.generateFindFromMethod(cw, getJavaClassName()); nfaGen.generateMatchMethod(cw, getJavaClassName()); + nfaGen.generateMatchIntoMethod(cw, getJavaClassName()); nfaGen.generateMatchBoundedMethod(cw, getJavaClassName()); nfaGen.generateFindMatchMethod(cw, getJavaClassName()); nfaGen.generateFindMatchFromMethod(cw, getJavaClassName()); @@ -399,6 +418,7 @@ public byte[] generate() throws Exception { literalGen.generateFindMethod(cw, getJavaClassName()); literalGen.generateFindFromMethod(cw, getJavaClassName()); literalGen.generateMatchMethod(cw, getJavaClassName()); + literalGen.generateMatchIntoMethod(cw, getJavaClassName()); literalGen.generateMatchBoundedMethod(cw, getJavaClassName()); literalGen.generateFindMatchMethod(cw, getJavaClassName()); literalGen.generateFindMatchFromMethod(cw, getJavaClassName()); @@ -428,6 +448,7 @@ public byte[] generate() throws Exception { // Generate public API methods (these call the parser methods) recursiveGen.generateMatchesMethod(cw, getJavaClassName()); + recursiveGen.generateMatchIntoMethod(cw, getJavaClassName()); recursiveGen.generateFindMethod(cw, getJavaClassName()); recursiveGen.generateFindFromMethod(cw, getJavaClassName()); recursiveGen.generateFindBoundsFromMethod(cw, getJavaClassName()); @@ -583,7 +604,11 @@ private boolean isCaseInsensitive(String pattern) { * for NFA-based strategies. */ private void generateConstructor( - ClassWriter cw, boolean needsNFAState, NFA nfa, Map nameMap) { + ClassWriter cw, + boolean needsNFAState, + boolean needsRecursiveDescent, + NFA nfa, + Map nameMap) { MethodVisitor mv = cw.visitMethod(ACC_PUBLIC, "", "()V", null, null); mv.visitCode(); @@ -614,6 +639,17 @@ private void generateConstructor( false); } + if (needsRecursiveDescent) { + mv.visitVarInsn(ALOAD, 0); // this + mv.visitLdcInsn(nfa != null ? nfa.getGroupCount() : countGroups(pattern)); // groupCount + mv.visitMethodInsn( + INVOKEVIRTUAL, + "com/datadoghq/reggie/runtime/ReggieMatcher", + "initRecursiveState", + "(I)V", + false); + } + if (!nameMap.isEmpty()) { // Build a HashMap with the named-group entries and call this.setNameToIndex(unmodifiableMap) mv.visitTypeInsn(NEW, "java/util/HashMap"); diff --git a/reggie-processor/src/test/java/com/datadoghq/reggie/processor/ReggieMatcherBytecodeGeneratorTest.java b/reggie-processor/src/test/java/com/datadoghq/reggie/processor/ReggieMatcherBytecodeGeneratorTest.java index 9acf31e..9899db9 100644 --- a/reggie-processor/src/test/java/com/datadoghq/reggie/processor/ReggieMatcherBytecodeGeneratorTest.java +++ b/reggie-processor/src/test/java/com/datadoghq/reggie/processor/ReggieMatcherBytecodeGeneratorTest.java @@ -179,14 +179,18 @@ void testVariableCaptureBackrefStrategy() throws Exception { @Test void testRecursiveDescentStrategy() throws Exception { - Object matcher = compile("\\d+?", "RecursiveMatcher"); + // \d+? (lazy) routes to JDK fallback: RECURSIVE_DESCENT lacks general alternation + // backtracking needed for lazy semantics. Use a quantified-backref pattern instead, + // which routes to RECURSIVE_DESCENT via hasQuantifiedBackrefs without lazy quantifiers. + Object matcher = compile("(\\d+)\\1{1,2}", "RecursiveMatcher"); Method matches = matcher.getClass().getMethod("matches", String.class); Method find = matcher.getClass().getMethod("find", String.class); - assertTrue((Boolean) matches.invoke(matcher, "1")); - assertTrue((Boolean) matches.invoke(matcher, "5")); - assertFalse((Boolean) matches.invoke(matcher, "a")); + assertTrue((Boolean) matches.invoke(matcher, "11")); // group="1", backref once + assertTrue((Boolean) matches.invoke(matcher, "1111")); // group="11", backref once + assertFalse((Boolean) matches.invoke(matcher, "12")); // backref mismatch + assertFalse((Boolean) matches.invoke(matcher, "1")); // no room for backref assertFalse((Boolean) matches.invoke(matcher, "")); - assertTrue((Boolean) find.invoke(matcher, "abc1def")); + assertTrue((Boolean) find.invoke(matcher, "x11y")); assertFalse((Boolean) find.invoke(matcher, "abc")); } @@ -229,11 +233,13 @@ void testFixedRepetitionBackrefStrategy() throws Exception { @Test void testOptionalGroupBackrefStrategy() throws Exception { - Object matcher = compile("^(a)?(b)?\\1\\2$", "OptGroupBackrefMatcher"); + // Use empty-alt form (a|) instead of (a)? so the group always participates + // (the quantified (X)? form now routes to JDK fallback for correct Java backref semantics) + Object matcher = compile("^(a|)(b|)\\1\\2$", "OptGroupBackrefMatcher"); Method matches = matcher.getClass().getMethod("matches", String.class); - // Group 1=(a)? and group 2=(b)? are optional; backrefs \1\2 match whatever they captured assertTrue((Boolean) matches.invoke(matcher, "abab")); assertTrue((Boolean) matches.invoke(matcher, "aa")); + assertTrue((Boolean) matches.invoke(matcher, "")); assertFalse((Boolean) matches.invoke(matcher, "abba")); assertFalse((Boolean) matches.invoke(matcher, "abc")); } diff --git a/reggie-processor/src/test/java/com/datadoghq/reggie/processor/parsing/RegexParserTest.java b/reggie-processor/src/test/java/com/datadoghq/reggie/processor/parsing/RegexParserTest.java index fd59a95..ff5dd26 100644 --- a/reggie-processor/src/test/java/com/datadoghq/reggie/processor/parsing/RegexParserTest.java +++ b/reggie-processor/src/test/java/com/datadoghq/reggie/processor/parsing/RegexParserTest.java @@ -245,4 +245,74 @@ void testDot() throws Exception { assertTrue(cc.chars.contains('0')); assertTrue(cc.chars.contains(' ')); } + + @Test + void testAtomicGroupParsesAsNonCapturingGroup() throws Exception { + RegexNode node = parser.parse("(?>abc)"); + assertTrue(node instanceof GroupNode); + GroupNode group = (GroupNode) node; + assertFalse(group.capturing); + assertEquals(0, group.groupNumber); + } + + @Test + void testAtomicGroupWithQuantifier() throws Exception { + RegexNode node = parser.parse("(?>a+)"); + assertTrue(node instanceof GroupNode); + GroupNode group = (GroupNode) node; + assertFalse(group.capturing); + assertTrue(group.child instanceof QuantifierNode); + } + + @Test + void testQuotedLiteral() throws Exception { + RegexNode node = parser.parse("\\Qabc\\E"); + assertTrue(node instanceof ConcatNode); + ConcatNode concat = (ConcatNode) node; + assertEquals(3, concat.children.size()); + assertEquals('a', ((LiteralNode) concat.children.get(0)).ch); + assertEquals('b', ((LiteralNode) concat.children.get(1)).ch); + assertEquals('c', ((LiteralNode) concat.children.get(2)).ch); + } + + @Test + void testQuotedLiteralTreatsMetacharactersAsLiterals() throws Exception { + RegexNode node = parser.parse("\\Q.\\E"); + assertTrue(node instanceof LiteralNode); + assertEquals('.', ((LiteralNode) node).ch); + } + + @Test + void testQuotedLiteralUnterminatedConsumesToEnd() throws Exception { + RegexNode node = parser.parse("\\Qabc"); + assertTrue(node instanceof ConcatNode); + ConcatNode concat = (ConcatNode) node; + assertEquals(3, concat.children.size()); + assertEquals('a', ((LiteralNode) concat.children.get(0)).ch); + assertEquals('b', ((LiteralNode) concat.children.get(1)).ch); + assertEquals('c', ((LiteralNode) concat.children.get(2)).ch); + } + + @Test + void testQuotedLiteralInsideCharacterClass() throws Exception { + RegexNode node = parser.parse("[a\\Q-]\\Eb]"); + assertTrue(node instanceof CharClassNode); + CharClassNode charClass = (CharClassNode) node; + assertTrue(charClass.chars.contains('a')); + assertTrue(charClass.chars.contains('-')); + assertTrue(charClass.chars.contains(']')); + assertTrue(charClass.chars.contains('b')); + assertFalse(charClass.chars.contains('c')); + } + + @Test + void testQuotedLiteralEmbeddedInPattern() throws Exception { + RegexNode node = parser.parse("foo\\Q/bar\\Ebaz"); + assertTrue(node instanceof ConcatNode); + ConcatNode concat = (ConcatNode) node; + assertEquals(7, concat.children.size()); + assertTrue(concat.children.get(3) instanceof ConcatNode); + ConcatNode quoted = (ConcatNode) concat.children.get(3); + assertEquals('/', ((LiteralNode) quoted.children.get(0)).ch); + } } diff --git a/reggie-runtime/src/main/java/com/datadoghq/reggie/CapturePolicy.java b/reggie-runtime/src/main/java/com/datadoghq/reggie/CapturePolicy.java new file mode 100644 index 0000000..b0d4748 --- /dev/null +++ b/reggie-runtime/src/main/java/com/datadoghq/reggie/CapturePolicy.java @@ -0,0 +1,28 @@ +/* + * Copyright 2026-Present Datadog, Inc. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package com.datadoghq.reggie; + +/** Controls which capturing groups Reggie should track and expose. */ +public enum CapturePolicy { + /** Track all capturing groups, matching java.util.regex group numbering semantics. */ + ALL, + + /** + * Track named groups and groups required by regex semantics (for example backreference targets). + * Unnamed groups that are only used for precedence are compiled as non-capturing groups. + */ + NAMED_ONLY +} diff --git a/reggie-runtime/src/main/java/com/datadoghq/reggie/Reggie.java b/reggie-runtime/src/main/java/com/datadoghq/reggie/Reggie.java index 26b2441..7235303 100644 --- a/reggie-runtime/src/main/java/com/datadoghq/reggie/Reggie.java +++ b/reggie-runtime/src/main/java/com/datadoghq/reggie/Reggie.java @@ -102,11 +102,25 @@ public static T patterns(Class patternClass) { * @param pattern the regex pattern string * @return compiled matcher instance * @throws java.util.regex.PatternSyntaxException if pattern is invalid + * @throws UnsupportedPatternException if pattern uses an unsupported regex construct */ public static ReggieMatcher compile(String pattern) { return RuntimeCompiler.compile(pattern); } + /** + * Compile a regex pattern at runtime with explicit options. + * + * @param pattern the regex pattern string + * @param options compilation options + * @return compiled matcher instance + * @throws java.util.regex.PatternSyntaxException if pattern is invalid + * @throws UnsupportedPatternException if pattern uses an unsupported regex construct + */ + public static ReggieMatcher compile(String pattern, ReggieOptions options) { + return RuntimeCompiler.compile(pattern, options); + } + /** * Compile a regex pattern with an explicit cache key. Useful for user-controlled caching when you * want the same compiled matcher for different pattern strings, or need to explicitly manage @@ -123,11 +137,17 @@ public static ReggieMatcher compile(String pattern) { * @param pattern the regex pattern string * @return compiled matcher instance * @throws java.util.regex.PatternSyntaxException if pattern is invalid + * @throws UnsupportedPatternException if pattern uses an unsupported regex construct */ public static ReggieMatcher cached(String key, String pattern) { return RuntimeCompiler.cached(key, pattern); } + /** Compile a regex pattern with an explicit cache key and options. */ + public static ReggieMatcher cached(String key, String pattern, ReggieOptions options) { + return RuntimeCompiler.cached(key, pattern, options); + } + /** * Clear the entire runtime pattern cache. This removes all cached compiled patterns, freeing * memory. Future calls to {@link #compile(String)} or {@link #cached(String, String)} will diff --git a/reggie-runtime/src/main/java/com/datadoghq/reggie/ReggieOptions.java b/reggie-runtime/src/main/java/com/datadoghq/reggie/ReggieOptions.java new file mode 100644 index 0000000..ca1e985 --- /dev/null +++ b/reggie-runtime/src/main/java/com/datadoghq/reggie/ReggieOptions.java @@ -0,0 +1,52 @@ +/* + * Copyright 2026-Present Datadog, Inc. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package com.datadoghq.reggie; + +import java.util.Objects; + +/** Options for runtime Reggie compilation. */ +public final class ReggieOptions { + public static final ReggieOptions DEFAULT = builder().build(); + + private final CapturePolicy capturePolicy; + + private ReggieOptions(Builder builder) { + this.capturePolicy = Objects.requireNonNull(builder.capturePolicy, "capturePolicy"); + } + + public CapturePolicy capturePolicy() { + return capturePolicy; + } + + public static Builder builder() { + return new Builder(); + } + + public static final class Builder { + private CapturePolicy capturePolicy = CapturePolicy.ALL; + + private Builder() {} + + public Builder capturePolicy(CapturePolicy capturePolicy) { + this.capturePolicy = Objects.requireNonNull(capturePolicy, "capturePolicy"); + return this; + } + + public ReggieOptions build() { + return new ReggieOptions(this); + } + } +} diff --git a/reggie-runtime/src/main/java/com/datadoghq/reggie/UnsupportedPatternException.java b/reggie-runtime/src/main/java/com/datadoghq/reggie/UnsupportedPatternException.java new file mode 100644 index 0000000..4a3530e --- /dev/null +++ b/reggie-runtime/src/main/java/com/datadoghq/reggie/UnsupportedPatternException.java @@ -0,0 +1,33 @@ +/* + * Copyright 2026-Present Datadog, Inc. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package com.datadoghq.reggie; + +/** + * Thrown when a pattern is syntactically valid regular-expression input but uses a construct that + * Reggie does not support. + * + *

This public exception type lets callers implement precise fallback logic without catching all + * exceptions from {@link Reggie#compile(String)}. + */ +public class UnsupportedPatternException extends RuntimeException { + public UnsupportedPatternException(String message) { + super(message); + } + + public UnsupportedPatternException(String message, Throwable cause) { + super(message, cause); + } +} diff --git a/reggie-runtime/src/main/java/com/datadoghq/reggie/runtime/DFATableRuntime.java b/reggie-runtime/src/main/java/com/datadoghq/reggie/runtime/DFATableRuntime.java new file mode 100644 index 0000000..3cb1fa9 --- /dev/null +++ b/reggie-runtime/src/main/java/com/datadoghq/reggie/runtime/DFATableRuntime.java @@ -0,0 +1,341 @@ +/* + * Copyright 2026-Present Datadog, Inc. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package com.datadoghq.reggie.runtime; + +/** Runtime helpers used by generated table-driven DFA matchers. */ +public final class DFATableRuntime { + private DFATableRuntime() {} + + public static int[] decodeRleIntArray(String[] chunks, int length) { + int[] result = new int[length]; + int out = 0; + CharReader reader = new CharReader(chunks); + while (out < length && reader.hasNext()) { + int value = readInt(reader); + int count = readInt(reader); + for (int i = 0; i < count && out < length; i++) { + result[out++] = value; + } + } + if (out != length) { + throw new IllegalArgumentException("Invalid encoded DFA int table"); + } + return result; + } + + public static boolean[] decodeBooleanArray(String[] chunks, int length) { + boolean[] result = new boolean[length]; + CharReader reader = new CharReader(chunks); + for (int i = 0; i < length; i++) { + if (!reader.hasNext()) { + throw new IllegalArgumentException("Invalid encoded DFA boolean table"); + } + result[i] = reader.next() != 0; + } + return result; + } + + public static char[] decodeCharArray(String[] chunks, int length) { + char[] result = new char[length]; + CharReader reader = new CharReader(chunks); + for (int i = 0; i < length; i++) { + if (!reader.hasNext()) { + throw new IllegalArgumentException("Invalid encoded DFA char table"); + } + result[i] = reader.next(); + } + return result; + } + + public static boolean matches( + CharSequence input, + int startState, + int classCount, + int[] transitions, + boolean[] accepting, + int[] asciiClasses, + boolean[] startAscii, + char[] rangeStarts, + char[] rangeEnds, + int[] rangeClasses) { + if (input == null) { + return false; + } + int state = startState; + for (int pos = 0; pos < input.length(); pos++) { + state = + nextState( + state, + input.charAt(pos), + classCount, + transitions, + asciiClasses, + rangeStarts, + rangeEnds, + rangeClasses); + if (state < 0) { + return false; + } + } + return state < accepting.length && accepting[state]; + } + + public static boolean matchesBounded( + CharSequence input, + int start, + int end, + int startState, + int classCount, + int[] transitions, + boolean[] accepting, + int[] asciiClasses, + boolean[] startAscii, + char[] rangeStarts, + char[] rangeEnds, + int[] rangeClasses) { + if (input == null || start < 0 || end < start || end > input.length()) { + return false; + } + int state = startState; + for (int pos = start; pos < end; pos++) { + state = + nextState( + state, + input.charAt(pos), + classCount, + transitions, + asciiClasses, + rangeStarts, + rangeEnds, + rangeClasses); + if (state < 0) { + return false; + } + } + return state < accepting.length && accepting[state]; + } + + public static int findFrom( + CharSequence input, + int start, + int startState, + int classCount, + int[] transitions, + boolean[] accepting, + int[] asciiClasses, + boolean[] startAscii, + char[] rangeStarts, + char[] rangeEnds, + int[] rangeClasses) { + if (input == null) { + return -1; + } + int length = input.length(); + int from = Math.max(0, start); + if (from > length) { + return -1; + } + + if (startState < accepting.length && accepting[startState]) { + return from; + } + + for (int candidate = from; candidate < length; candidate++) { + char first = input.charAt(candidate); + if (first < 128 && !startAscii[first]) { + continue; + } + int state = + nextState( + startState, + first, + classCount, + transitions, + asciiClasses, + rangeStarts, + rangeEnds, + rangeClasses); + if (state < 0) { + continue; + } + if (state < accepting.length && accepting[state]) { + return candidate; + } + for (int pos = candidate + 1; pos < length; pos++) { + state = + nextState( + state, + input.charAt(pos), + classCount, + transitions, + asciiClasses, + rangeStarts, + rangeEnds, + rangeClasses); + if (state < 0) { + break; + } + if (state < accepting.length && accepting[state]) { + return candidate; + } + } + } + return -1; + } + + public static boolean findBoundsFrom( + CharSequence input, + int start, + int[] bounds, + int startState, + int classCount, + int[] transitions, + boolean[] accepting, + int[] asciiClasses, + boolean[] startAscii, + char[] rangeStarts, + char[] rangeEnds, + int[] rangeClasses) { + if (input == null || bounds == null || bounds.length < 2) { + return false; + } + int length = input.length(); + int from = Math.max(0, start); + if (from > length) { + return false; + } + + if (startState < accepting.length && accepting[startState]) { + bounds[0] = from; + bounds[1] = from; + return true; + } + + for (int candidate = from; candidate < length; candidate++) { + char first = input.charAt(candidate); + if (first < 128 && !startAscii[first]) { + continue; + } + int state = + nextState( + startState, + first, + classCount, + transitions, + asciiClasses, + rangeStarts, + rangeEnds, + rangeClasses); + if (state < 0) { + continue; + } + if (state < accepting.length && accepting[state]) { + bounds[0] = candidate; + bounds[1] = candidate + 1; + return true; + } + for (int pos = candidate + 1; pos < length; pos++) { + state = + nextState( + state, + input.charAt(pos), + classCount, + transitions, + asciiClasses, + rangeStarts, + rangeEnds, + rangeClasses); + if (state < 0) { + break; + } + if (state < accepting.length && accepting[state]) { + bounds[0] = candidate; + bounds[1] = pos + 1; + return true; + } + } + } + return false; + } + + private static int nextState( + int state, + char ch, + int classCount, + int[] transitions, + int[] asciiClasses, + char[] rangeStarts, + char[] rangeEnds, + int[] rangeClasses) { + if (state < 0) { + return -1; + } + int cls = charClass(ch, asciiClasses, rangeStarts, rangeEnds, rangeClasses); + int index = state * classCount + cls; + return index >= 0 && index < transitions.length ? transitions[index] : -1; + } + + private static int charClass( + char ch, int[] asciiClasses, char[] rangeStarts, char[] rangeEnds, int[] rangeClasses) { + if (ch < 128) { + return asciiClasses[ch]; + } + int low = 0; + int high = rangeStarts.length - 1; + while (low <= high) { + int mid = (low + high) >>> 1; + if (ch < rangeStarts[mid]) { + high = mid - 1; + } else if (ch > rangeEnds[mid]) { + low = mid + 1; + } else { + return rangeClasses[mid]; + } + } + return 0; + } + + private static int readInt(CharReader reader) { + int high = reader.next(); + int low = reader.next(); + return (high << 16) | low; + } + + private static final class CharReader { + private final String[] chunks; + private int chunkIndex; + private int charIndex; + + private CharReader(String[] chunks) { + this.chunks = chunks; + } + + private boolean hasNext() { + while (chunkIndex < chunks.length && charIndex >= chunks[chunkIndex].length()) { + chunkIndex++; + charIndex = 0; + } + return chunkIndex < chunks.length; + } + + private char next() { + if (!hasNext()) { + throw new IllegalArgumentException("Unexpected end of encoded DFA table"); + } + return chunks[chunkIndex].charAt(charIndex++); + } + } +} diff --git a/reggie-runtime/src/main/java/com/datadoghq/reggie/runtime/HybridMatcher.java b/reggie-runtime/src/main/java/com/datadoghq/reggie/runtime/HybridMatcher.java index 8cee562..d44b8db 100644 --- a/reggie-runtime/src/main/java/com/datadoghq/reggie/runtime/HybridMatcher.java +++ b/reggie-runtime/src/main/java/com/datadoghq/reggie/runtime/HybridMatcher.java @@ -53,6 +53,14 @@ public MatchResult match(String input) { return enrich(nfaMatcher.match(input)); } + @Override + public boolean matchInto(String input, int[] groupStarts, int[] groupEnds) { + if (!dfaMatcher.matches(input)) { + return false; + } + return nfaMatcher.matchInto(input, groupStarts, groupEnds); + } + @Override public boolean matchesBounded(CharSequence input, int start, int end) { return dfaMatcher.matchesBounded(input, start, end); diff --git a/reggie-runtime/src/main/java/com/datadoghq/reggie/runtime/JavaRegexFallbackMatcher.java b/reggie-runtime/src/main/java/com/datadoghq/reggie/runtime/JavaRegexFallbackMatcher.java index 60df7ce..54dd04c 100644 --- a/reggie-runtime/src/main/java/com/datadoghq/reggie/runtime/JavaRegexFallbackMatcher.java +++ b/reggie-runtime/src/main/java/com/datadoghq/reggie/runtime/JavaRegexFallbackMatcher.java @@ -83,6 +83,38 @@ public MatchResult findMatchFrom(String input, int start) { return m.find(start) ? toMatchResult(input, m) : null; } + @Override + public boolean matchInto(String input, int[] groupStarts, int[] groupEnds) { + java.util.regex.Matcher m = javaPattern.matcher(input); + if (!m.matches()) { + return false; + } + copyGroups(m, groupStarts, groupEnds); + return true; + } + + @Override + public boolean findMatchInto(String input, int start, int[] groupStarts, int[] groupEnds) { + java.util.regex.Matcher m = javaPattern.matcher(input); + if (!m.find(start)) { + return false; + } + copyGroups(m, groupStarts, groupEnds); + return true; + } + + private void copyGroups(java.util.regex.Matcher m, int[] groupStarts, int[] groupEnds) { + int gc = m.groupCount(); + if (groupStarts.length <= gc || groupEnds.length <= gc) { + throw new IndexOutOfBoundsException( + "group arrays must have length at least " + (gc + 1) + " for this pattern"); + } + for (int i = 0; i <= gc; i++) { + groupStarts[i] = m.start(i); + groupEnds[i] = m.end(i); + } + } + private MatchResult toMatchResult(String input, java.util.regex.Matcher m) { int gc = m.groupCount(); int[] starts = new int[gc + 1]; diff --git a/reggie-runtime/src/main/java/com/datadoghq/reggie/runtime/LinearTokenSequenceMatcher.java b/reggie-runtime/src/main/java/com/datadoghq/reggie/runtime/LinearTokenSequenceMatcher.java new file mode 100644 index 0000000..c4e922b --- /dev/null +++ b/reggie-runtime/src/main/java/com/datadoghq/reggie/runtime/LinearTokenSequenceMatcher.java @@ -0,0 +1,462 @@ +/* + * Copyright 2026-Present Datadog, Inc. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package com.datadoghq.reggie.runtime; + +import com.datadoghq.reggie.codegen.analysis.LinearTokenSequencePlan; +import java.util.Arrays; +import java.util.Map; +import java.util.Objects; + +/** Generic runtime executor for deterministic linear-token-sequence plans. */ +final class LinearTokenSequenceMatcher extends ReggieMatcher { + private final LinearTokenSequencePlan plan; + private final int groupCount; + private final int[] scratchStarts; + private final int[] scratchEnds; + private final int[][] optionalScratchStarts; + private final int[][] optionalScratchEnds; + + LinearTokenSequenceMatcher( + String pattern, + LinearTokenSequencePlan plan, + int groupCount, + Map nameToIndex) { + super(pattern); + this.plan = plan; + this.groupCount = groupCount; + this.nameToIndex = Map.copyOf(nameToIndex); + this.scratchStarts = new int[groupCount + 1]; + this.scratchEnds = new int[groupCount + 1]; + int optionalDepth = maxOptionalDepth(plan.ops()); + this.optionalScratchStarts = new int[optionalDepth][groupCount + 1]; + this.optionalScratchEnds = new int[optionalDepth][groupCount + 1]; + } + + @Override + public boolean matches(String input) { + return matchInto(input, scratchStarts, scratchEnds); + } + + @Override + public boolean find(String input) { + return findFrom(input, 0) >= 0; + } + + @Override + public int findFrom(String input, int start) { + Objects.requireNonNull(input, "input"); + if (start < 0 || start > input.length()) return -1; + for (int pos = start; pos <= input.length(); pos++) { + if (matchesAt(input, pos, scratchStarts, scratchEnds, false)) return pos; + } + return -1; + } + + @Override + public MatchResult match(String input) { + int[] starts = new int[groupCount + 1]; + int[] ends = new int[groupCount + 1]; + return matchInto(input, starts, ends) + ? new MatchResultImpl(input, starts, ends, groupCount, nameToIndex) + : null; + } + + @Override + public boolean matchesBounded(CharSequence input, int start, int end) { + Objects.requireNonNull(input, "input"); + return start >= 0 + && end >= start + && end <= input.length() + && matches(input.subSequence(start, end).toString()); + } + + @Override + public MatchResult matchBounded(CharSequence input, int start, int end) { + Objects.requireNonNull(input, "input"); + if (start < 0 || end < start || end > input.length()) return null; + return match(input.subSequence(start, end).toString()); + } + + @Override + public MatchResult findMatch(String input) { + return findMatchFrom(input, 0); + } + + @Override + public MatchResult findMatchFrom(String input, int start) { + int pos = findFrom(input, start); + if (pos < 0) return null; + int[] starts = new int[groupCount + 1]; + int[] ends = new int[groupCount + 1]; + return matchesAt(input, pos, starts, ends, false) + ? new MatchResultImpl(input, starts, ends, groupCount, nameToIndex) + : null; + } + + @Override + public boolean matchInto(String input, int[] groupStarts, int[] groupEnds) { + Objects.requireNonNull(input, "input"); + Objects.requireNonNull(groupStarts, "groupStarts"); + Objects.requireNonNull(groupEnds, "groupEnds"); + if (groupStarts.length <= groupCount || groupEnds.length <= groupCount) { + throw new IndexOutOfBoundsException("group arrays too small for " + groupCount + " groups"); + } + if (!matchesAt(input, 0, scratchStarts, scratchEnds, true)) return false; + System.arraycopy(scratchStarts, 0, groupStarts, 0, groupCount + 1); + System.arraycopy(scratchEnds, 0, groupEnds, 0, groupCount + 1); + return true; + } + + private boolean matchesAt(String input, int offset, int[] starts, int[] ends, boolean fullMatch) { + Arrays.fill(starts, -1); + Arrays.fill(ends, -1); + starts[0] = offset; + int pos = offset; + for (int i = 0; i < plan.ops().size(); i++) { + LinearTokenSequencePlan.Op op = plan.ops().get(i); + if (isTargetBeforeOptionalHttpVersion(plan.ops(), i)) { + pos = + captureTargetBeforeOptionalHttpVersion( + op, plan.ops().get(i + 1), input, pos, starts, ends); + if (pos < 0) return false; + i++; + continue; + } + pos = apply(op, input, pos, starts, ends, i == plan.ops().size() - 1, 0); + if (pos < 0) return false; + } + if (fullMatch && pos != input.length()) return false; + ends[0] = fullMatch ? input.length() : pos; + return true; + } + + private int apply( + LinearTokenSequencePlan.Op op, + String input, + int pos, + int[] starts, + int[] ends, + boolean lastOp, + int optionalDepth) { + return switch (op.kind()) { + case LITERAL -> startsWith(input, pos, op.literal()) ? pos + op.literal().length() : -1; + case WHITESPACE_PLUS -> skipWhitespace(input, pos); + case CAPTURE_NON_SPACE -> captureNonSpace(input, pos, op.groupNumber(), starts, ends); + case CAPTURE_DIGITS -> captureDigits(input, pos, op.groupNumber(), starts, ends); + case CAPTURE_SIGNED_INTEGER -> + captureSignedInteger(input, pos, op.groupNumber(), starts, ends); + case CAPTURE_SIGNED_INTEGER_OR_DASH -> + captureSignedIntegerOrDash(input, pos, op.groupNumber(), starts, ends, true); + case CAPTURE_SIGNED_INTEGER_OR_UNCAPTURED_DASH -> + captureSignedIntegerOrDash(input, pos, op.groupNumber(), starts, ends, false); + case CAPTURE_DECIMAL_NUMBER -> + captureDecimal(input, pos, op.groupNumber(), starts, ends, false); + case CAPTURE_SIGNED_DECIMAL_NUMBER -> + captureDecimal(input, pos, op.groupNumber(), starts, ends, true); + case CAPTURE_WORD -> captureWord(input, pos, op.groupNumber(), starts, ends); + case CAPTURE_UNTIL_DELIMITER -> + captureUntil(input, pos, op.delimiter(), op.groupNumber(), starts, ends); + case CAPTURE_QUOTED_UNTIL_DELIMITER -> + captureQuotedUntil(input, pos, op.delimiter(), op.groupNumber(), starts, ends, false); + case CAPTURE_QUOTED_NON_SPACE -> + captureQuotedUntil(input, pos, op.delimiter(), op.groupNumber(), starts, ends, true); + case CAPTURE_IP_OR_HOST -> captureIpOrHost(input, pos, op.groupNumber(), starts, ends); + case CAPTURE_BRACKETED_WORD_AFTER_SKIP -> + captureBracketedWordAfterSkip(input, pos, op.groupNumber(), starts, ends); + case SKIP_ANY -> lastOp ? input.length() : -1; + case ANCHOR -> pos; + case OPTIONAL_SEQUENCE -> applyOptional(op, input, pos, starts, ends, optionalDepth); + }; + } + + private static int captureNonSpace(String input, int pos, int group, int[] starts, int[] ends) { + int start = pos; + while (pos < input.length() && !Character.isWhitespace(input.charAt(pos))) pos++; + if (pos == start) return -1; + set(starts, ends, group, start, pos); + return pos; + } + + private static int captureDigits(String input, int pos, int group, int[] starts, int[] ends) { + int start = pos; + while (pos < input.length() && isDigit(input.charAt(pos))) pos++; + if (pos == start) return -1; + set(starts, ends, group, start, pos); + return pos; + } + + private static int captureSignedInteger( + String input, int pos, int group, int[] starts, int[] ends) { + int start = pos; + if (pos < input.length() && (input.charAt(pos) == '+' || input.charAt(pos) == '-')) pos++; + int digitStart = pos; + while (pos < input.length() && isDigit(input.charAt(pos))) pos++; + if (pos == digitStart) return -1; + set(starts, ends, group, start, pos); + return pos; + } + + private static int captureSignedIntegerOrDash( + String input, int pos, int group, int[] starts, int[] ends, boolean captureDash) { + if (pos < input.length() && input.charAt(pos) == '-') { + if (captureDash) set(starts, ends, group, pos, pos + 1); + return pos + 1; + } + return captureSignedInteger(input, pos, group, starts, ends); + } + + private static int captureDecimal( + String input, int pos, int group, int[] starts, int[] ends, boolean signed) { + int start = pos; + if (signed && pos < input.length() && (input.charAt(pos) == '+' || input.charAt(pos) == '-')) { + pos++; + } + int digitStart = pos; + while (pos < input.length() && isDigit(input.charAt(pos))) pos++; + boolean sawLeadingDigits = pos > digitStart; + if (pos < input.length() && input.charAt(pos) == '.') { + pos++; + int fractionStart = pos; + while (pos < input.length() && isDigit(input.charAt(pos))) pos++; + if (!sawLeadingDigits && pos == fractionStart) return -1; + } else if (!sawLeadingDigits) { + return -1; + } + set(starts, ends, group, start, pos); + return pos; + } + + private static int captureWord(String input, int pos, int group, int[] starts, int[] ends) { + int start = pos; + while (pos < input.length() && isWord(input.charAt(pos))) pos++; + if (pos == start) return -1; + set(starts, ends, group, start, pos); + return pos; + } + + private static int captureUntil( + String input, int pos, char delimiter, int group, int[] starts, int[] ends) { + int end = input.indexOf(delimiter, pos); + if (end < 0) return -1; + set(starts, ends, group, pos, end); + return end; + } + + private static int captureQuotedUntil( + String input, + int pos, + char delimiter, + int group, + int[] starts, + int[] ends, + boolean nonSpace) { + if (pos >= input.length() || input.charAt(pos) != '"') return -1; + int start = pos + 1; + int end = input.indexOf(delimiter, start); + if (end < 0) return -1; + if (nonSpace) { + for (int i = start; i < end; i++) { + if (Character.isWhitespace(input.charAt(i))) return -1; + } + } + set(starts, ends, group, start, end); + return end + 1; + } + + private static int captureIpOrHost(String input, int pos, int group, int[] starts, int[] ends) { + int end = captureNonSpace(input, pos, group, starts, ends); + return end >= 0 && isIpOrHost(input, pos, end) ? end : -1; + } + + private static int captureBracketedWordAfterSkip( + String input, int pos, int group, int[] starts, int[] ends) { + int search = pos; + int lastStart = -1; + int lastEnd = -1; + while (search < input.length()) { + int open = input.indexOf('[', search); + if (open < 0) break; + int close = input.indexOf(']', open + 1); + if (close < 0) break; + int wordEnd = open + 1; + while (wordEnd < close && isWord(input.charAt(wordEnd))) wordEnd++; + if (wordEnd == close + && wordEnd > open + 1 + && close + 1 < input.length() + && Character.isWhitespace(input.charAt(close + 1))) { + lastStart = open + 1; + lastEnd = close; + } + search = open + 1; + } + if (lastStart < 0) return -1; + set(starts, ends, group, lastStart, lastEnd); + return input.length(); + } + + private static boolean isTargetBeforeOptionalHttpVersion( + java.util.List ops, int index) { + if (index + 2 >= ops.size()) return false; + LinearTokenSequencePlan.Op target = ops.get(index); + LinearTokenSequencePlan.Op optional = ops.get(index + 1); + LinearTokenSequencePlan.Op quote = ops.get(index + 2); + if (target.kind() != LinearTokenSequencePlan.OpKind.CAPTURE_NON_SPACE + || optional.kind() != LinearTokenSequencePlan.OpKind.OPTIONAL_SEQUENCE + || quote.kind() != LinearTokenSequencePlan.OpKind.LITERAL + || !quote.literal().startsWith("\"")) return false; + if (optional.children().size() != 2) return false; + LinearTokenSequencePlan.Op prefix = optional.children().get(0); + LinearTokenSequencePlan.Op version = optional.children().get(1); + return prefix.kind() == LinearTokenSequencePlan.OpKind.LITERAL + && " HTTP/".equals(prefix.literal()) + && version.kind() == LinearTokenSequencePlan.OpKind.CAPTURE_DECIMAL_NUMBER; + } + + private static int captureTargetBeforeOptionalHttpVersion( + LinearTokenSequencePlan.Op target, + LinearTokenSequencePlan.Op optionalHttpVersion, + String input, + int pos, + int[] starts, + int[] ends) { + int quote = input.indexOf('"', pos); + if (quote < 0 || quote == pos) return -1; + int marker = input.lastIndexOf(" HTTP/", quote); + if (marker >= pos && isNonSpace(input, pos, marker)) { + LinearTokenSequencePlan.Op version = optionalHttpVersion.children().get(1); + int versionStart = marker + " HTTP/".length(); + int versionEnd = scanDecimal(input, versionStart, quote, false); + if (versionEnd == quote) { + set(starts, ends, target.groupNumber(), pos, marker); + set(starts, ends, version.groupNumber(), versionStart, quote); + return quote; + } + } + if (!isNonSpace(input, pos, quote)) return -1; + set(starts, ends, target.groupNumber(), pos, quote); + return quote; + } + + private int applyOptional( + LinearTokenSequencePlan.Op op, + String input, + int pos, + int[] starts, + int[] ends, + int optionalDepth) { + int[] savedStarts = optionalScratchStarts[optionalDepth]; + int[] savedEnds = optionalScratchEnds[optionalDepth]; + System.arraycopy(starts, 0, savedStarts, 0, starts.length); + System.arraycopy(ends, 0, savedEnds, 0, ends.length); + int next = pos; + for (int i = 0; i < op.children().size(); i++) { + next = + apply( + op.children().get(i), + input, + next, + starts, + ends, + i == op.children().size() - 1, + optionalDepth + 1); + if (next < 0) { + System.arraycopy(savedStarts, 0, starts, 0, starts.length); + System.arraycopy(savedEnds, 0, ends, 0, ends.length); + return pos; + } + } + return next; + } + + private static int maxOptionalDepth(Iterable ops) { + int max = 0; + for (LinearTokenSequencePlan.Op op : ops) { + int childDepth = maxOptionalDepth(op.children()); + max = + Math.max( + max, + op.kind() == LinearTokenSequencePlan.OpKind.OPTIONAL_SEQUENCE + ? 1 + childDepth + : childDepth); + } + return max; + } + + private static int skipWhitespace(String input, int pos) { + int start = pos; + while (pos < input.length() && Character.isWhitespace(input.charAt(pos))) pos++; + return pos == start ? -1 : pos; + } + + private static boolean startsWith(String input, int pos, String prefix) { + return pos >= 0 && pos + prefix.length() <= input.length() && input.startsWith(prefix, pos); + } + + private static void set(int[] starts, int[] ends, int group, int start, int end) { + if (group > 0) { + starts[group] = start; + ends[group] = end; + } + } + + private static boolean isIpOrHost(String input, int start, int end) { + for (int i = start; i < end; i++) { + char ch = input.charAt(i); + if (!isAsciiAlphaNum(ch) && ch != '-' && ch != '_' && ch != '.' && ch != ':' && ch != '%') { + return false; + } + } + return end > start; + } + + private static boolean isNonSpace(String input, int start, int end) { + if (end <= start) return false; + for (int i = start; i < end; i++) { + if (Character.isWhitespace(input.charAt(i))) return false; + } + return true; + } + + private static int scanDecimal(String input, int pos, int limit, boolean signed) { + if (signed && pos < limit && (input.charAt(pos) == '+' || input.charAt(pos) == '-')) { + pos++; + } + int digitStart = pos; + while (pos < limit && isDigit(input.charAt(pos))) pos++; + boolean sawLeadingDigits = pos > digitStart; + if (pos < limit && input.charAt(pos) == '.') { + pos++; + int fractionStart = pos; + while (pos < limit && isDigit(input.charAt(pos))) pos++; + if (!sawLeadingDigits && pos == fractionStart) return -1; + } else if (!sawLeadingDigits) { + return -1; + } + return pos; + } + + private static boolean isDigit(char ch) { + return ch >= '0' && ch <= '9'; + } + + private static boolean isWord(char ch) { + return isAsciiAlphaNum(ch) || ch == '_'; + } + + private static boolean isAsciiAlphaNum(char ch) { + return isDigit(ch) || (ch >= 'a' && ch <= 'z') || (ch >= 'A' && ch <= 'Z'); + } +} diff --git a/reggie-runtime/src/main/java/com/datadoghq/reggie/runtime/ReggieMatcher.java b/reggie-runtime/src/main/java/com/datadoghq/reggie/runtime/ReggieMatcher.java index 99f0447..b28e772 100644 --- a/reggie-runtime/src/main/java/com/datadoghq/reggie/runtime/ReggieMatcher.java +++ b/reggie-runtime/src/main/java/com/datadoghq/reggie/runtime/ReggieMatcher.java @@ -53,6 +53,10 @@ protected void setNameToIndex(Map map) { protected StateSet epsilonProcessed; protected int[] groupStarts; protected int[] groupEnds; + protected int[][] configGroupStarts; + protected int[][] configGroupEnds; + protected int[] parentStateMap; + protected int[] recursiveGroups; // Stateful iterators for efficient BitSet iteration // Pre-allocated to avoid allocation overhead in hot path @@ -82,13 +86,23 @@ protected void initNFAState(int stateCount, int groupCount) { this.nextStatesIter = this.nextStates.iterator(); this.epsilonProcessedIter = this.epsilonProcessed.iterator(); + this.groupStarts = new int[groupCount + 1]; + this.groupEnds = new int[groupCount + 1]; if (groupCount > 0) { - this.groupStarts = new int[groupCount + 1]; - this.groupEnds = new int[groupCount + 1]; + this.configGroupStarts = new int[stateCount][groupCount + 1]; + this.configGroupEnds = new int[stateCount][groupCount + 1]; + this.parentStateMap = new int[stateCount]; } } } + /** Initialize reusable state for recursive-descent matchers. */ + protected void initRecursiveState(int groupCount) { + if (this.recursiveGroups == null) { + this.recursiveGroups = new int[(groupCount + 1) * 2]; + } + } + /** * Tests whether the entire input string matches the pattern. This method is allocation-free. * @@ -164,6 +178,88 @@ protected void initNFAState(int stateCount, int groupCount) { */ public abstract MatchResult findMatchFrom(String input, int start); + /** + * Tests whether the entire input string matches and stores group boundaries in caller-provided + * arrays. Group 0 is the entire match; groups 1..n are capturing groups. Unmatched optional + * groups are stored as {@code -1} in both arrays. + * + *

The default implementation delegates to {@link #match(String)} and therefore may allocate. + * Bytecode generators can override this method to populate the arrays directly. + * + * @param input the string to match + * @param groupStarts caller-provided array for group start offsets + * @param groupEnds caller-provided array for group end offsets + * @return true if the entire input matches; false otherwise, leaving arrays unchanged + * @throws NullPointerException if any argument is null + * @throws IndexOutOfBoundsException if the arrays are too small for the matched pattern's groups + */ + public boolean matchInto(String input, int[] groupStarts, int[] groupEnds) { + Objects.requireNonNull(input, "input"); + Objects.requireNonNull(groupStarts, "groupStarts"); + Objects.requireNonNull(groupEnds, "groupEnds"); + + MatchResult match = match(input); + if (match == null) { + return false; + } + copyGroups(match, groupStarts, groupEnds); + return true; + } + + /** + * Finds the first match and stores group boundaries in caller-provided arrays. Equivalent to + * {@link #findMatchInto(String, int, int[], int[])} with {@code start == 0}. + * + * @param input the string to search + * @param groupStarts caller-provided array for group start offsets + * @param groupEnds caller-provided array for group end offsets + * @return true if a match is found; false otherwise, leaving arrays unchanged + */ + public boolean findMatchInto(String input, int[] groupStarts, int[] groupEnds) { + return findMatchInto(input, 0, groupStarts, groupEnds); + } + + /** + * Finds a match starting at the given offset and stores group boundaries in caller-provided + * arrays. Group 0 is the entire match; groups 1..n are capturing groups. Unmatched optional + * groups are stored as {@code -1} in both arrays. + * + *

The default implementation delegates to {@link #findMatchFrom(String, int)} and therefore + * may allocate. Bytecode generators can override this method to populate the arrays directly. + * + * @param input the string to search + * @param start the starting offset + * @param groupStarts caller-provided array for group start offsets + * @param groupEnds caller-provided array for group end offsets + * @return true if a match is found; false otherwise, leaving arrays unchanged + * @throws NullPointerException if any array/input argument is null + * @throws IndexOutOfBoundsException if the arrays are too small for the matched pattern's groups + */ + public boolean findMatchInto(String input, int start, int[] groupStarts, int[] groupEnds) { + Objects.requireNonNull(input, "input"); + Objects.requireNonNull(groupStarts, "groupStarts"); + Objects.requireNonNull(groupEnds, "groupEnds"); + + MatchResult match = findMatchFrom(input, start); + if (match == null) { + return false; + } + copyGroups(match, groupStarts, groupEnds); + return true; + } + + protected static void copyGroups(MatchResult match, int[] groupStarts, int[] groupEnds) { + int groups = match.groupCount(); + if (groupStarts.length <= groups || groupEnds.length <= groups) { + throw new IndexOutOfBoundsException( + "group arrays must have length at least " + (groups + 1) + " for this pattern"); + } + for (int i = 0; i <= groups; i++) { + groupStarts[i] = match.start(i); + groupEnds[i] = match.end(i); + } + } + /** * Finds the pattern starting at the given offset and stores match boundaries in the provided * array. This is an allocation-free alternative to findMatchFrom() for operations that only need diff --git a/reggie-runtime/src/main/java/com/datadoghq/reggie/runtime/RuntimeCompiler.java b/reggie-runtime/src/main/java/com/datadoghq/reggie/runtime/RuntimeCompiler.java index 3e4a587..c714450 100644 --- a/reggie-runtime/src/main/java/com/datadoghq/reggie/runtime/RuntimeCompiler.java +++ b/reggie-runtime/src/main/java/com/datadoghq/reggie/runtime/RuntimeCompiler.java @@ -17,16 +17,21 @@ import static org.objectweb.asm.Opcodes.*; +import com.datadoghq.reggie.CapturePolicy; +import com.datadoghq.reggie.ReggieOptions; import com.datadoghq.reggie.codegen.analysis.BackreferencePatternInfo; +import com.datadoghq.reggie.codegen.analysis.CaptureProjection; import com.datadoghq.reggie.codegen.analysis.ConcatGreedyGroupInfo; import com.datadoghq.reggie.codegen.analysis.ConcatQuantifiedGroupsInfo; import com.datadoghq.reggie.codegen.analysis.FallbackPatternDetector; import com.datadoghq.reggie.codegen.analysis.FixedRepetitionBackrefInfo; import com.datadoghq.reggie.codegen.analysis.GreedyBacktrackInfo; import com.datadoghq.reggie.codegen.analysis.LinearPatternInfo; +import com.datadoghq.reggie.codegen.analysis.LinearTokenSequencePlan; import com.datadoghq.reggie.codegen.analysis.NestedQuantifiedGroupsInfo; import com.datadoghq.reggie.codegen.analysis.OptionalGroupBackrefInfo; import com.datadoghq.reggie.codegen.analysis.PatternAnalyzer; +import com.datadoghq.reggie.codegen.analysis.PatternCategorizer; import com.datadoghq.reggie.codegen.analysis.QuantifiedGroupInfo; import com.datadoghq.reggie.codegen.analysis.StructuralHash; import com.datadoghq.reggie.codegen.analysis.VariableCaptureBackrefInfo; @@ -38,6 +43,7 @@ import com.datadoghq.reggie.codegen.codegen.ConcatGreedyGroupBytecodeGenerator; import com.datadoghq.reggie.codegen.codegen.ConcatQuantifiedGroupsBytecodeGenerator; import com.datadoghq.reggie.codegen.codegen.DFASwitchBytecodeGenerator; +import com.datadoghq.reggie.codegen.codegen.DFATableBytecodeGenerator; import com.datadoghq.reggie.codegen.codegen.DFAUnrolledBytecodeGenerator; import com.datadoghq.reggie.codegen.codegen.FixedRepetitionBackrefBytecodeGenerator; import com.datadoghq.reggie.codegen.codegen.FixedSequenceBytecodeGenerator; @@ -78,8 +84,10 @@ public class RuntimeCompiler { private static final ConcurrentHashMap PATTERN_CACHE = new ConcurrentHashMap<>(); - // Level 2: Structural hash → generated class (deduplication for similar patterns) - private static final ConcurrentHashMap> STRUCTURE_CACHE = + // Level 2: Structural hash → generated class (deduplication for similar patterns). + // Key is Long (64-bit) to make birthday collisions essentially impossible across large pattern + // sets; an int key was observed to cause structural-cache false-hits with wrong match semantics. + private static final ConcurrentHashMap> STRUCTURE_CACHE = new ConcurrentHashMap<>(); // Lookup for hidden class definition (Java 21+) @@ -94,7 +102,16 @@ public class RuntimeCompiler { * @throws PatternSyntaxException if pattern is invalid */ public static ReggieMatcher compile(String pattern) { - return PATTERN_CACHE.computeIfAbsent(pattern, RuntimeCompiler::compileInternal); + return compile(pattern, ReggieOptions.DEFAULT); + } + + /** Compile pattern with runtime compilation options. */ + public static ReggieMatcher compile(String pattern, ReggieOptions options) { + if (options.capturePolicy() == CapturePolicy.ALL) { + return PATTERN_CACHE.computeIfAbsent(pattern, RuntimeCompiler::compileInternal); + } + String cacheKey = pattern + "\u0000capturePolicy=" + options.capturePolicy(); + return PATTERN_CACHE.computeIfAbsent(cacheKey, k -> compileInternal(pattern, options)); } /** @@ -109,6 +126,11 @@ public static ReggieMatcher cached(String key, String pattern) { return PATTERN_CACHE.computeIfAbsent(key, k -> compileInternal(pattern)); } + /** Compile with explicit cache key and runtime compilation options. */ + public static ReggieMatcher cached(String key, String pattern, ReggieOptions options) { + return PATTERN_CACHE.computeIfAbsent(key, k -> compileInternal(pattern, options)); + } + /** Clear both pattern and structure caches. */ public static void clearCache() { PATTERN_CACHE.clear(); @@ -139,11 +161,23 @@ public static Set cachedPatterns() { * cache (level 2) checked here */ private static ReggieMatcher compileInternal(String pattern) { + return compileInternal(pattern, ReggieOptions.DEFAULT); + } + + private static ReggieMatcher compileInternal(String pattern, ReggieOptions options) { try { // 1. Parse pattern to AST RegexParser parser = new RegexParser(); RegexNode ast = parser.parse(pattern); Map nameMap = parser.getGroupNameMap(); + if (options.capturePolicy() == CapturePolicy.NAMED_ONLY) { + ast = CaptureProjection.preserveNamedAndSemanticCaptures(ast); + ReggieMatcher linearTokenSequenceMatcher = + tryCompileLinearTokenSequence(pattern, ast, nameMap); + if (linearTokenSequenceMatcher != null) { + return linearTokenSequenceMatcher; + } + } // 2. Check if pattern requires recursive descent (context-free features) // Do this early to avoid unnecessary NFA building @@ -165,6 +199,24 @@ private static ReggieMatcher compileInternal(String pattern) { PatternAnalyzer.MatchingStrategyResult result = analyzer.analyzeAndRecommend(); // 3.5. Fall back to java.util.regex for patterns with known engine bugs + if (result.anchorConditionDiluted) { + ReggieMatcher fallback = + new JavaRegexFallbackMatcher(pattern, "anchor condition diluted in DFA construction"); + if (!nameMap.isEmpty()) { + fallback.setNameToIndex(nameMap); + } + return fallback; + } + if (result.alternationPriorityConflict) { + ReggieMatcher fallback = + new JavaRegexFallbackMatcher( + pattern, + "alternation priority conflict: DFA longest-match vs NFA first-alternative"); + if (!nameMap.isEmpty()) { + fallback.setNameToIndex(nameMap); + } + return fallback; + } String fallbackReason = FallbackPatternDetector.needsFallback(ast, result.strategy); if (fallbackReason != null) { ReggieMatcher fallback = new JavaRegexFallbackMatcher(pattern, fallbackReason); @@ -181,8 +233,8 @@ private static ReggieMatcher compileInternal(String pattern) { return hybrid; } - // 5. Compute structural hash for level 2 cache lookup - int structHash = + // 5. Compute structural hash for level 2 cache lookup (64-bit key) + long structHash = (nfa != null) ? StructuralHash.compute(result, nfa, caseInsensitive) : StructuralHash.computeWithoutGroupCount(result); @@ -220,6 +272,25 @@ private static ReggieMatcher compileInternal(String pattern) { return matcher; + } catch (org.objectweb.asm.MethodTooLargeException e) { + // Very large grok-style alternations can exceed JVM method-size limits. Preserve drop-in + // behavior by falling back to java.util.regex instead of failing compilation, but include the + // generated method and bytecode size in the warning so routing/generator fixes can be guided + // by real-world patterns. + ReggieMatcher fallback = + new JavaRegexFallbackMatcher( + pattern, + "generated method too large: " + + e.getClassName() + + "." + + e.getMethodName() + + e.getDescriptor() + + " codeSize=" + + e.getCodeSize()); + return fallback; + } catch (RegexParser.UnsupportedPatternException | UnsupportedOperationException e) { + throw new com.datadoghq.reggie.UnsupportedPatternException( + "Unsupported regex pattern: " + pattern + ": " + e.getMessage(), e); } catch (PatternSyntaxException e) { // Re-throw PatternSyntaxException as-is throw e; @@ -229,6 +300,44 @@ private static ReggieMatcher compileInternal(String pattern) { } } + private static ReggieMatcher tryCompileLinearTokenSequence( + String pattern, RegexNode ast, Map nameMap) { + return LinearTokenSequencePlan.from(PatternCategorizer.categorize(ast)) + .filter(RuntimeCompiler::isRuntimeExecutableLinearTokenSequence) + .map(plan -> new LinearTokenSequenceMatcher(pattern, plan, countGroups(pattern), nameMap)) + .map(NameEnrichingMatcher::new) + .orElse(null); + } + + private static boolean isRuntimeExecutableLinearTokenSequence(LinearTokenSequencePlan plan) { + for (int i = 0; i < plan.ops().size(); i++) { + LinearTokenSequencePlan.Op op = plan.ops().get(i); + if (op.kind() == LinearTokenSequencePlan.OpKind.ANCHOR) return false; + if (op.kind() == LinearTokenSequencePlan.OpKind.SKIP_ANY && i != plan.ops().size() - 1) { + return false; + } + if (op.kind() == LinearTokenSequencePlan.OpKind.OPTIONAL_SEQUENCE + && i + 1 < plan.ops().size() + && canOptionalPresentBranchStealFollowingInput(op, plan.ops().get(i + 1))) { + return false; + } + } + return true; + } + + private static boolean canOptionalPresentBranchStealFollowingInput( + LinearTokenSequencePlan.Op optional, LinearTokenSequencePlan.Op next) { + if (optional.children().isEmpty()) return false; + LinearTokenSequencePlan.Op first = optional.children().get(0); + if (first.kind() == LinearTokenSequencePlan.OpKind.LITERAL + && next.kind() == LinearTokenSequencePlan.OpKind.LITERAL) { + return !first.literal().isEmpty() + && !next.literal().isEmpty() + && first.literal().charAt(0) == next.literal().charAt(0); + } + return false; + } + /** * Check if the strategy would benefit from hybrid mode. Hybrid mode uses DFA for fast matching * and NFA for group extraction. @@ -361,7 +470,9 @@ private static byte[] generateBytecode( result.strategy == PatternAnalyzer.MatchingStrategy.OPTIMIZED_NFA || result.strategy == PatternAnalyzer.MatchingStrategy.OPTIMIZED_NFA_WITH_BACKREFS || result.strategy == PatternAnalyzer.MatchingStrategy.OPTIMIZED_NFA_WITH_LOOKAROUND - || result.strategy == PatternAnalyzer.MatchingStrategy.HYBRID_DFA_LOOKAHEAD; + || result.strategy == PatternAnalyzer.MatchingStrategy.HYBRID_DFA_LOOKAHEAD + || result.strategy == PatternAnalyzer.MatchingStrategy.SPECIALIZED_MULTIPLE_LOOKAHEADS + || result.strategy == PatternAnalyzer.MatchingStrategy.SPECIALIZED_LITERAL_LOOKAHEADS; boolean needsRecursiveDescent = result.strategy == PatternAnalyzer.MatchingStrategy.RECURSIVE_DESCENT; generateConstructor(cw, pattern, className, needsNFAState, needsRecursiveDescent, nfa, ast); @@ -536,6 +647,7 @@ private static byte[] generateBytecode( switchGen.generateFindFromMethod(cw, "com/datadoghq/reggie/runtime/" + className); switchGen.generateMatchesAtStartMethod(cw); // Required by findFrom switchGen.generateMatchMethod(cw, "com/datadoghq/reggie/runtime/" + className); + switchGen.generateMatchIntoMethod(cw, "com/datadoghq/reggie/runtime/" + className); switchGen.generateMatchesBoundedMethod(cw, "com/datadoghq/reggie/runtime/" + className); switchGen.generateMatchBoundedMethod(cw, "com/datadoghq/reggie/runtime/" + className); switchGen.generateFindMatchMethod(cw, "com/datadoghq/reggie/runtime/" + className); @@ -543,6 +655,21 @@ private static byte[] generateBytecode( switchGen.generateFindBoundsFromMethod(cw, "com/datadoghq/reggie/runtime/" + className); break; + case DFA_TABLE: + DFATableBytecodeGenerator tableGen = new DFATableBytecodeGenerator(result.dfa); + tableGen.generateStaticData(cw, "com/datadoghq/reggie/runtime/" + className); + tableGen.generateMatchesMethod(cw, "com/datadoghq/reggie/runtime/" + className); + tableGen.generateFindMethod(cw, "com/datadoghq/reggie/runtime/" + className); + tableGen.generateFindFromMethod(cw, "com/datadoghq/reggie/runtime/" + className); + tableGen.generateMatchMethod(cw, "com/datadoghq/reggie/runtime/" + className); + tableGen.generateMatchIntoMethod(cw, "com/datadoghq/reggie/runtime/" + className); + tableGen.generateMatchesBoundedMethod(cw, "com/datadoghq/reggie/runtime/" + className); + tableGen.generateMatchBoundedMethod(cw, "com/datadoghq/reggie/runtime/" + className); + tableGen.generateFindMatchMethod(cw, "com/datadoghq/reggie/runtime/" + className); + tableGen.generateFindMatchFromMethod(cw, "com/datadoghq/reggie/runtime/" + className); + tableGen.generateFindBoundsFromMethod(cw, "com/datadoghq/reggie/runtime/" + className); + break; + case FIXED_REPETITION_BACKREF: FixedRepetitionBackrefInfo fixedRepBackrefInfo = (FixedRepetitionBackrefInfo) result.patternInfo; @@ -632,6 +759,7 @@ private static byte[] generateBytecode( literalGen.generateFindMethod(cw, "com/datadoghq/reggie/runtime/" + className); literalGen.generateFindFromMethod(cw, "com/datadoghq/reggie/runtime/" + className); literalGen.generateMatchMethod(cw, "com/datadoghq/reggie/runtime/" + className); + literalGen.generateMatchIntoMethod(cw, "com/datadoghq/reggie/runtime/" + className); literalGen.generateMatchBoundedMethod(cw, "com/datadoghq/reggie/runtime/" + className); literalGen.generateFindLongestMatchEndMethod( cw, "com/datadoghq/reggie/runtime/" + className); @@ -658,6 +786,7 @@ private static byte[] generateBytecode( hybridGen.generateFindMethod(cw, "com/datadoghq/reggie/runtime/" + className); hybridGen.generateFindFromMethod(cw, "com/datadoghq/reggie/runtime/" + className); hybridGen.generateMatchMethod(cw, "com/datadoghq/reggie/runtime/" + className); + hybridGen.generateMatchIntoMethod(cw, "com/datadoghq/reggie/runtime/" + className); hybridGen.generateMatchBoundedMethod(cw, "com/datadoghq/reggie/runtime/" + className); hybridGen.generateFindLongestMatchEndMethod( cw, "com/datadoghq/reggie/runtime/" + className); @@ -682,6 +811,7 @@ private static byte[] generateBytecode( nfaGen.generateFindMethod(cw, "com/datadoghq/reggie/runtime/" + className); nfaGen.generateFindFromMethod(cw, "com/datadoghq/reggie/runtime/" + className); nfaGen.generateMatchMethod(cw, "com/datadoghq/reggie/runtime/" + className); + nfaGen.generateMatchIntoMethod(cw, "com/datadoghq/reggie/runtime/" + className); nfaGen.generateMatchBoundedMethod( cw, "com/datadoghq/reggie/runtime/" + className); // Phase 1.1 optimization nfaGen.generateMatchesBoundedMethod(cw, "com/datadoghq/reggie/runtime/" + className); @@ -710,6 +840,7 @@ private static byte[] generateBytecode( nfaGen.generateFindMethod(cw, "com/datadoghq/reggie/runtime/" + className); nfaGen.generateFindFromMethod(cw, "com/datadoghq/reggie/runtime/" + className); nfaGen.generateMatchMethod(cw, "com/datadoghq/reggie/runtime/" + className); + nfaGen.generateMatchIntoMethod(cw, "com/datadoghq/reggie/runtime/" + className); nfaGen.generateMatchBoundedMethod( cw, "com/datadoghq/reggie/runtime/" + className); // Phase 1.1 optimization nfaGen.generateFindLongestMatchEndMethod( @@ -735,6 +866,7 @@ private static byte[] generateBytecode( // Now generate public API methods (these call the parser methods) recursiveGen.generateMatchesMethod(cw, "com/datadoghq/reggie/runtime/" + className); + recursiveGen.generateMatchIntoMethod(cw, "com/datadoghq/reggie/runtime/" + className); recursiveGen.generateFindMethod(cw, "com/datadoghq/reggie/runtime/" + className); recursiveGen.generateFindFromMethod(cw, "com/datadoghq/reggie/runtime/" + className); recursiveGen.generateFindBoundsFromMethod(cw, "com/datadoghq/reggie/runtime/" + className); @@ -833,8 +965,16 @@ private static void generateConstructor( false); } - // Recursive descent doesn't need special constructor initialization - // Parser state is managed in the parser methods themselves + if (needsRecursiveDescent) { + mv.visitVarInsn(ALOAD, 0); // this + mv.visitLdcInsn(nfa != null ? nfa.getGroupCount() : countGroups(pattern)); // groupCount + mv.visitMethodInsn( + INVOKEVIRTUAL, + "com/datadoghq/reggie/runtime/ReggieMatcher", + "initRecursiveState", + "(I)V", + false); + } mv.visitInsn(RETURN); mv.visitMaxs(0, 0); diff --git a/reggie-runtime/src/test/java/com/datadoghq/reggie/runtime/AnchorDiagTest.java b/reggie-runtime/src/test/java/com/datadoghq/reggie/runtime/AnchorDiagTest.java new file mode 100644 index 0000000..9359d92 --- /dev/null +++ b/reggie-runtime/src/test/java/com/datadoghq/reggie/runtime/AnchorDiagTest.java @@ -0,0 +1,116 @@ +/* + * Copyright 2026-Present Datadog, Inc. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package com.datadoghq.reggie.runtime; + +import static org.junit.jupiter.api.Assertions.assertEquals; + +import com.datadoghq.reggie.Reggie; +import java.util.regex.Matcher; +import java.util.regex.Pattern; +import org.junit.jupiter.api.Test; + +/** Regression coverage for fuzz $ anchor findings. */ +class AnchorDiagTest { + @Test + void diagNoClearCacheEver() { + // Verify that $ patterns work correctly even when compiled AFTER many other patterns, + // without any clearCache() in between. Bypasses check() to avoid its internal clearCache(). + RuntimeCompiler.clearCache(); // single clear at start only + for (char c = 'a'; c <= 'z'; c++) { + Reggie.compile(String.valueOf(c)); + Reggie.compile("[" + c + "]"); + Reggie.compile(c + "."); + Reggie.compile("." + c); + if (c != 'a') Reggie.compile(c + "$"); + if (c != 'z') Reggie.compile("." + c + "$"); + } + for (char d = '0'; d <= '9'; d++) { + Reggie.compile(String.valueOf(d)); + Reggie.compile(d + "$"); + } + // Now test the $ patterns without any additional clearCache + String[][] cases = { + {"c$", "c"}, {".$", "b"}, {"[b]${1}", "b"}, {"$", "c"}, {"$", "_"}, + {"a?$", ""}, {".{0}$", ""}, {"${1}", ""}, {"Z{1}|$", ""}, {"0|${1}", ""} + }; + for (String[] tc : cases) { + assertFindEquivalent(tc[0], tc[1], false); + } + } + + @Test + void diag() { + check("c$", "c"); + check(".$", "b"); + check("[b]${1}", "b"); + check("$", "c"); + check("$", "_"); + check("a?$", ""); + check(".{0}$", ""); + check("$c?", ""); + check("${1}", ""); + check("${3}", ""); + check("Z{1}|$", ""); + check("0|${1}", ""); + check("[c]*(?:[_]?-)$|]", "-"); + check("^{1}|.", "b"); + } + + static void check(String pat, String inp) { + assertFindEquivalent(pat, inp, true); + } + + private static void assertFindEquivalent(String pat, String inp, boolean clearCache) { + if (clearCache) { + RuntimeCompiler.clearCache(); + } + Pattern jdk = Pattern.compile(pat); + Matcher jm = jdk.matcher(inp); + boolean jdkFound = jm.find(); + + ReggieMatcher rm = Reggie.compile(pat); + MatchResult r = rm.findMatch(inp); + boolean reggieFound = r != null; + + assertEquals(jdkFound, reggieFound, failureMessage(pat, inp, jm, jdkFound, r, rm)); + if (jdkFound) { + assertEquals(jm.start(), r.start(), failureMessage(pat, inp, jm, jdkFound, r, rm)); + assertEquals(jm.end(), r.end(), failureMessage(pat, inp, jm, jdkFound, r, rm)); + } + } + + private static String failureMessage( + String pat, String inp, Matcher jm, boolean jdkFound, MatchResult r, ReggieMatcher rm) { + return "pat=" + + pat + + " input=\"" + + inp + + "\" jdk=" + + span(jm, jdkFound) + + " reggie=" + + span(r) + + " strategy=" + + rm.getClass().getSimpleName(); + } + + private static String span(Matcher matcher, boolean found) { + return found ? "[" + matcher.start() + "," + matcher.end() + ")" : "null"; + } + + private static String span(MatchResult result) { + return result != null ? "[" + result.start() + "," + result.end() + ")" : "null"; + } +} diff --git a/reggie-runtime/src/test/java/com/datadoghq/reggie/runtime/AnchorRegressionTest.java b/reggie-runtime/src/test/java/com/datadoghq/reggie/runtime/AnchorRegressionTest.java new file mode 100644 index 0000000..f929653 --- /dev/null +++ b/reggie-runtime/src/test/java/com/datadoghq/reggie/runtime/AnchorRegressionTest.java @@ -0,0 +1,293 @@ +/* + * Copyright 2026-Present Datadog, Inc. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package com.datadoghq.reggie.runtime; + +import static org.junit.jupiter.api.Assertions.assertEquals; + +import com.datadoghq.reggie.Reggie; +import java.util.regex.Matcher; +import java.util.regex.Pattern; +import org.junit.jupiter.api.BeforeEach; +import org.junit.jupiter.api.Test; + +/** + * Regression tests for the anchor-aware DFA construction fix. Each case cross-checks Reggie's + * {@code find}/{@code findMatch}/{@code matches} behavior against {@link java.util.regex.Pattern} + * to lock in the corrected semantics. + * + *

Before the fix the DFA-based code path silently dropped {@code $}/{@code \\Z}/{@code \\z} + * anchors that were not at the rightmost position of their concat, treated all alternation branches + * as sharing a single global end-anchor check, and never validated the start state's end-anchor + * condition. See the design note in {@code SubsetConstructor.precomputeAnchoredClosures} for the + * rules now applied. + */ +public class AnchorRegressionTest { + + @BeforeEach + void clearCache() { + RuntimeCompiler.clearCache(); + } + + // --- $ anchor placement --------------------------------------------------------------- + + @Test + void dollarFollowedByConsumer_isUnsatisfiable() { + expectFindNone("$x", ""); + expectFindNone("$x", "x"); + expectFindNone("$x", "ax"); + expectFindNone("$x", "xa"); + expectFindNone("$x", "a$x"); + } + + @Test + void bareDollar_matchesAtEndOfInput() { + expectFindMatch("$", "", 0, 0); + expectFindMatch("$", "x", 1, 1); + expectFindMatch("$", "ax", 2, 2); + expectFindMatch("$", "xa", 2, 2); + expectFindMatch("$", "abc", 3, 3); + } + + @Test + void dollarAtEndOfConcat_keepsWorking() { + expectFindMatch("x$", "x", 0, 1); + expectFindMatch("x$", "ax", 1, 2); + expectFindNone("x$", "xa"); + } + + @Test + void dollarAtEndOfBranch_keepsWorking() { + expectFindNone("a$|b", "x"); + expectFindMatch("a$|b", "xa", 1, 2); + expectFindMatch("a$|b", "b", 0, 1); + expectFindMatch("a$|b", "ab", 1, 2); + } + + @Test + void dollarAtHeadOfBranch_isDeadButOtherBranchSurvives() { + expectFindNone("$a|b", "x"); + expectFindNone("$a|b", "ax"); + expectFindNone("$a|b", "xa"); + expectFindMatch("$a|b", "b", 0, 1); + expectFindMatch("$a|b", "ab", 1, 2); + } + + // --- ^ anchor placement --------------------------------------------------------------- + + @Test + void startAnchor_matchesOnlyAtPositionZero() { + expectFindMatch("^[0-9]", "1abc", 0, 1); + expectFindNone("^[0-9]", "abc1"); + expectFindMatch("^[0-9]", "1", 0, 1); + expectFindNone("^[0-9]", "abc"); + } + + @Test + void stringStartAnchorInOneAlternative_doesNotSkipOtherBranchPositions() { + // Cat F from fuzz triage: the find()-loop optimization used to read + // `hasStringStartAnchor` (any \A anywhere) instead of `requiresStartAnchor` (all paths + // need \A). For patterns like `]\A|b` where only one branch has \A, the optimization was + // returning -1 at non-zero positions, masking the always-position-valid branch. + expectFindMatch("]\\A|b", "cb", 1, 2); + expectFindMatch("]\\A|b", "b", 0, 1); + expectFindMatch("_{1}(\\A)|_", "-_", 1, 2); + } + + @Test + void mixedStartAndEndAnchorAlternatives() { + expectFindMatch("^a|b$", "abc", 0, 1); + expectFindMatch("^a|b$", "ab", 0, 1); + expectFindNone("^a|b$", "ba"); + expectFindNone("^a|b$", "ca"); + expectFindMatch("^a|b$", "cb", 1, 2); + expectFindMatch("^a|b$", "ac", 0, 1); + } + + // --- The original bug report ----------------------------------------------------------- + + @Test + void originalUserPattern_dollarCharClassOrStartDigit() { + String regex = "$[^a-zA-Z0-9]|^[0-9]"; + expectFindNone(regex, "abc"); + expectFindMatch(regex, "1abc", 0, 1); + expectFindNone(regex, "abc!"); + expectFindNone(regex, "abc.def"); + expectFindMatch(regex, "1", 0, 1); + expectFindNone(regex, "!abc"); + expectFindNone(regex, "."); + expectFindNone(regex, ""); + } + + @Test + void trailingZeroes_doesNotMatchInMiddle() { + String regex = "\\.?0+$"; + expectFindMatch(regex, "10.00", 2, 5); + expectFindMatch(regex, "10.0", 2, 4); + expectFindMatch(regex, "100", 1, 3); + expectFindNone(regex, "abc.00def"); + expectFindNone(regex, "1.5"); + } + + // --- $ before trailing newline (Java: $ matches at end OR before final '\n') -------- + + @Test + void dollarMatchesBeforeTrailingNewline() { + // In Java, $ (non-multiline) is semantically identical to \Z: matches at end of + // string OR immediately before a single trailing '\n'. + expectFindMatch("c$", "c\n", 0, 1); + expectFindMatch(".$", "b\n", 0, 1); + expectFindMatch("a?$", "\n", 0, 0); + expectFindMatch("$", "c\n", 1, 1); + expectFindMatch("$", "\n", 0, 0); + expectFindMatch("Z{1}|$", "\n", 0, 0); + expectFindMatch("[c]*(?:[_]?-)$", "-\n", 0, 1); + } + + // --- \A / \Z / \z --------------------------------------------------------------------- + + @Test + void stringStartAndEndAnchors() { + expectFindMatch("\\A", "", 0, 0); + expectFindMatch("\\A", "ax", 0, 0); + expectFindMatch("\\Z", "", 0, 0); + expectFindMatch("\\Z", "ax", 2, 2); + expectFindMatch("\\z", "", 0, 0); + expectFindMatch("\\z", "ax", 2, 2); + } + + // --- Cat E/F residual: anchor-condition dilution in DFA subset construction ----------- + + @Test + void anchorConditionDilution_zStringEndBeforeConsumer_alternatedWithUnanchored() { + // \Z.[a]{1}|_- on "_a": branch 1 has \Z anchor, branch 2 has none; + // mergeWeakest dropped the anchor, causing DFA to accept incorrectly. + expectFindNone("\\Z.[a]{1}|_-", "_a"); + } + + @Test + void anchorConditionDilution_disjointAcceptConditions_quantifiedGroup() { + // [ca]{2}(Z?^|\Z) on "cab": the accept states inside the group have disjoint + // conditions ({START_MULTILINE} vs {STRING_END}); intersection collapsed to + // unconditional, accepting after the two-char prefix. + expectFindNone("[ca]{2}(Z?^|\\Z)", "cab"); + } + + @Test + void anchorConditionDilution_zStringEndStar_alternatedWithLiteralSuffix() { + // \Z[1]*|1] on "1": JDK finds empty match at [1,1) (only the \Z branch matches, + // zero-width at end), Reggie was greedily matching the `1` via the diluted path. + expectFindMatch("\\Z[1]*|1]", "1", 1, 1); + } + + @Test + void anchorConditionDilution_zeroMinQuantifierWithStartAnchor_alternation() { + // (1{0,}^|]{2}) on "1": JDK finds zero-width match at [0,0) (the ^ branch fires + // at pos 0), Reggie was consuming the `1` via the diluted path. + expectFindMatch("(1{0,}^|]{2})", "1", 0, 0); + } + + // --- matches() vs \Z before trailing newline ------------------------------------------ + + @Test + void stringEndMatchesMode_doesNotConsumeTrailingNewline() { + // matches() requires the FULL input to be consumed. \Z can accept before the final '\n' + // for find(), but in matches() mode the trailing '\n' is NOT consumed by \Z, so the + // full input "abc\n" is not covered and matches() must return false. + expectMatchesFalse("abc\\Z", "abc\n"); + expectMatchesFalse(".*abc\\Z", "abc\n"); + expectMatchesFalse("[^1]\\Z|-", "a\n"); + // Absolute end anchor \z behaves the same (it never admits the trailing '\n') + expectMatchesFalse("abc\\z", "abc\n"); + // Without trailing '\n', \Z and \z both accept normally + expectMatchesTrue("abc\\Z", "abc"); + expectMatchesTrue("[^1]\\Z|-", "a"); + expectMatchesTrue("[^1]\\Z|-", "-"); + // When the pattern CONSUMES the '\n' (e.g., [^1] matches '\n'), \Z at absolute end → true + expectMatchesTrue("[^1]\\Z", "\n"); + expectMatchesTrue("[^1]\\Z|-", "\n"); + } + + // --- Helpers -------------------------------------------------------------------------- + + private static void expectMatchesTrue(String regex, String input) { + Pattern jdk = Pattern.compile(regex); + if (!jdk.matcher(input).matches()) { + throw new IllegalArgumentException( + "Test premise wrong: JDK matches('" + input + "') for /" + regex + "/ returned false"); + } + ReggieMatcher m = Reggie.compile(regex); + org.junit.jupiter.api.Assertions.assertTrue( + m.matches(input), + () -> "Reggie matches('" + input + "') for /" + regex + "/ should be true"); + } + + private static void expectMatchesFalse(String regex, String input) { + Pattern jdk = Pattern.compile(regex); + if (jdk.matcher(input).matches()) { + throw new IllegalArgumentException( + "Test premise wrong: JDK matches('" + input + "') for /" + regex + "/ returned true"); + } + ReggieMatcher m = Reggie.compile(regex); + org.junit.jupiter.api.Assertions.assertFalse( + m.matches(input), + () -> "Reggie matches('" + input + "') for /" + regex + "/ should be false"); + } + + private static void expectFindMatch(String regex, String input, int start, int end) { + Pattern jdk = Pattern.compile(regex); + Matcher jm = jdk.matcher(input); + boolean jdkMatched = jm.find(); + if (!(jdkMatched && jm.start() == start && jm.end() == end)) { + throw new IllegalArgumentException( + "Test premise wrong: JDK did not match pattern '" + + regex + + "' on '" + + input + + "' as [" + + start + + "," + + end + + ")"); + } + ReggieMatcher m = Reggie.compile(regex); + MatchResult mr = m.findMatch(input); + assertEquals( + "[" + start + "," + end + ")", + mr == null ? "none" : "[" + mr.start() + "," + mr.end() + ")", + () -> + "Reggie find('" + + input + + "') for /" + + regex + + "/ should be [" + + start + + "," + + end + + ")"); + } + + private static void expectFindNone(String regex, String input) { + Pattern jdk = Pattern.compile(regex); + if (jdk.matcher(input).find()) { + throw new IllegalArgumentException( + "Test premise wrong: JDK matched pattern '" + regex + "' on '" + input + "'"); + } + ReggieMatcher m = Reggie.compile(regex); + MatchResult mr = m.findMatch(input); + assertEquals( + null, mr, () -> "Reggie find('" + input + "') for /" + regex + "/ should not match"); + } +} diff --git a/reggie-runtime/src/test/java/com/datadoghq/reggie/runtime/BoundedQuantifierRegressionTest.java b/reggie-runtime/src/test/java/com/datadoghq/reggie/runtime/BoundedQuantifierRegressionTest.java new file mode 100644 index 0000000..6eb6aec --- /dev/null +++ b/reggie-runtime/src/test/java/com/datadoghq/reggie/runtime/BoundedQuantifierRegressionTest.java @@ -0,0 +1,144 @@ +/* + * Copyright 2026-Present Datadog, Inc. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package com.datadoghq.reggie.runtime; + +import static org.junit.jupiter.api.Assertions.assertEquals; + +import com.datadoghq.reggie.Reggie; +import java.util.regex.Matcher; +import java.util.regex.Pattern; +import org.junit.jupiter.api.BeforeEach; +import org.junit.jupiter.api.Test; + +/** + * Regression tests for two bounded-quantifier bugs in the find()/findMatch() path: + * + *

    + *
  1. {@code STATELESS_LOOP}: {@code generateFindMatchFromMethod} ignored the quantifier's upper + * bound when greedy-extending the match end, so {@code [0-9]{5}} matched all digits and + * {@code [0-9]{5,7}} matched up to the input length. The fix caps the matchEnd scan at {@code + * matchStart + maxReps}. + *
  2. {@code DFA_SWITCH}: {@code MultiRangeOptimization} silently fell back to scanning only the + * first range when the multi-range layout was not the hand-written {@code [a-zA-Z]} or {@code + * [a-zA-Z0-9]} shape. For a pattern starting with {@code [-_]?[0-9]} the SWAR scan searched + * only for {@code '-'} and missed inputs that started with a digit or underscore — the + * symptom the user reported as "{@code [-_]?[0-9]{5,}} truncates digits". The fix gates + * {@code MultiRangeOptimization} to the two supported shapes. + *
+ */ +public class BoundedQuantifierRegressionTest { + + @BeforeEach + void clearCache() { + RuntimeCompiler.clearCache(); + } + + // --- STATELESS_LOOP upper-bound cap -------------------------------------------------- + + @Test + void exactlyN_doesNotOverConsume() { + expectFindMatch("[0-9]{5}", "1234567890", 0, 5); + expectFindMatch("[0-9]{5}", "abc1234567xyz", 3, 8); + expectFindMatch("a{5}", "aaaaaaa", 0, 5); + } + + @Test + void boundedRange_capsAtUpperBound() { + expectFindMatch("[0-9]{5,7}", "1234567890", 0, 7); + expectFindMatch("[0-9]{5,7}", "12345", 0, 5); + expectFindMatch("[0-9]{5,7}", "123456", 0, 6); + expectFindMatch("a{5,7}", "aaaaaaa", 0, 7); + } + + @Test + void wideBoundedRange_capsAtUpperBound() { + expectFindMatch("[0-9]{5,99}", "1".repeat(150), 0, 99); + expectFindMatch("[0-9]{5,99}", "1".repeat(50), 0, 50); + } + + @Test + void unboundedRange_unchanged() { + expectFindMatch("[0-9]{5,}", "1".repeat(150), 0, 150); + expectFindMatch("[0-9]{5,}", "12345", 0, 5); + } + + // --- DFA_SWITCH multi-range first-char filter ---------------------------------------- + + @Test + void multiRangePrefixed_findsAllStartingChars() { + String regex = "[-_]?[0-9]{5,99}"; + expectFindMatch(regex, "12345", 0, 5); + expectFindMatch(regex, "lib-1234567890.so", 3, 14); + expectFindMatch(regex, "lib_1234567890.so", 3, 14); + expectFindMatch(regex, "1234567890", 0, 10); + } + + // --- Counted quantifier with min == 0 ------------------------------------------------- + + @Test + void countedQuantifier_minZero_allowsZeroReps() { + // Cat C from fuzz triage: {n,m} with n=0 was missing the "skip the whole quantifier" + // ε-bypass in ThompsonBuilder.buildCountedQuantifier, so `[ab]c{0,3}` against "a" failed + // and so did anything of the form `prefix X{0,N}` against a prefix-only input. + expectFindMatch("[^c]c{0,3}", "b", 0, 1); + expectFindMatch("[ab]c{0,3}", "a", 0, 1); + expectFindMatch("[ab]c{0,3}", "b", 0, 1); + expectFindMatch("[ab]c{0,3}", "ac", 0, 2); + expectFindMatch("[ab]c{0,3}", "accc", 0, 4); + expectFindMatch("c{0,3}", "", 0, 0); + expectFindMatch("c{0,3}", "c", 0, 1); + expectFindMatch("c{0,3}", "cccc", 0, 3); + expectFindMatch("a[^c]c{0,3}", "ab", 0, 2); + expectFindMatch("a[^c]c{0,3}", "ab1", 0, 2); + } + + @Test + void multiRangePrefixed_atVariousBounds() { + // {5,1}-bound below the DFA_UNROLLED→DFA_SWITCH threshold still goes through DFA_UNROLLED + // and worked pre-fix; the fix here is about not regressing it when SWAR is enabled. + expectFindMatch("[-_]?[0-9]{5,10}", "1234567890", 0, 10); + // Bounds that push the DFA past ~20 states route to DFA_SWITCH. + expectFindMatch("[-_]?[0-9]{1,99}", "12345", 0, 5); + expectFindMatch("[-_]?[0-9]{2,99}", "12345", 0, 5); + } + + // --- Helpers -------------------------------------------------------------------------- + + private static void expectFindMatch(String regex, String input, int start, int end) { + Pattern jdk = Pattern.compile(regex); + Matcher jm = jdk.matcher(input); + if (!(jm.find() && jm.start() == start && jm.end() == end)) { + throw new IllegalArgumentException( + "Test premise wrong: JDK did not match /" + + regex + + "/ on '" + + input + + "' as [" + + start + + "," + + end + + ")"); + } + ReggieMatcher m = Reggie.compile(regex); + MatchResult mr = m.findMatch(input); + String expected = "[" + start + "," + end + ")"; + String actual = mr == null ? "none" : "[" + mr.start() + "," + mr.end() + ")"; + assertEquals( + expected, + actual, + () -> "Reggie find('" + input + "') for /" + regex + "/ should be " + expected); + } +} diff --git a/reggie-runtime/src/test/java/com/datadoghq/reggie/runtime/CapturePolicyTest.java b/reggie-runtime/src/test/java/com/datadoghq/reggie/runtime/CapturePolicyTest.java new file mode 100644 index 0000000..aa68ca3 --- /dev/null +++ b/reggie-runtime/src/test/java/com/datadoghq/reggie/runtime/CapturePolicyTest.java @@ -0,0 +1,62 @@ +/* + * Copyright 2026-Present Datadog, Inc. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package com.datadoghq.reggie.runtime; + +import static org.junit.jupiter.api.Assertions.*; + +import com.datadoghq.reggie.CapturePolicy; +import com.datadoghq.reggie.Reggie; +import com.datadoghq.reggie.ReggieOptions; +import org.junit.jupiter.api.Test; + +class CapturePolicyTest { + + @Test + void namedOnlyPreservesNamedGroupIndexesAndDropsInternalCaptures() { + ReggieMatcher matcher = + Reggie.compile( + "(?(a|b)+)-(?(c))", + ReggieOptions.builder().capturePolicy(CapturePolicy.NAMED_ONLY).build()); + + MatchResult result = matcher.match("abba-c"); + assertNotNull(result); + assertEquals(4, result.groupCount()); + assertEquals("abba", result.group(1)); + assertNull(result.group(2)); + assertEquals("c", result.group(3)); + assertNull(result.group(4)); + } + + @Test + void namedOnlyMatchIntoUsesOriginalNamedGroupIndexes() { + ReggieMatcher matcher = + Reggie.compile( + "(?(a|b)+)-(?(c))", + ReggieOptions.builder().capturePolicy(CapturePolicy.NAMED_ONLY).build()); + + int[] starts = new int[5]; + int[] ends = new int[5]; + assertTrue(matcher.matchInto("abba-c", starts, ends)); + assertEquals(0, starts[1]); + assertEquals(4, ends[1]); + assertEquals(-1, starts[2]); + assertEquals(-1, ends[2]); + assertEquals(5, starts[3]); + assertEquals(6, ends[3]); + assertEquals(-1, starts[4]); + assertEquals(-1, ends[4]); + } +} diff --git a/reggie-runtime/src/test/java/com/datadoghq/reggie/runtime/DFAStateBudgetFallbackTest.java b/reggie-runtime/src/test/java/com/datadoghq/reggie/runtime/DFAStateBudgetFallbackTest.java new file mode 100644 index 0000000..0ad6ed6 --- /dev/null +++ b/reggie-runtime/src/test/java/com/datadoghq/reggie/runtime/DFAStateBudgetFallbackTest.java @@ -0,0 +1,57 @@ +/* + * Copyright 2026-Present Datadog, Inc. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package com.datadoghq.reggie.runtime; + +import static org.junit.jupiter.api.Assertions.assertDoesNotThrow; +import static org.junit.jupiter.api.Assertions.assertFalse; +import static org.junit.jupiter.api.Assertions.assertTrue; + +import com.datadoghq.reggie.Reggie; +import org.junit.jupiter.api.BeforeEach; +import org.junit.jupiter.api.Test; + +class DFAStateBudgetFallbackTest { + + @BeforeEach + void clearCache() { + RuntimeCompiler.clearCache(); + } + + @Test + void largePureDFAUsesTableBackend() { + ReggieMatcher matcher = Reggie.compile("(?:[a-z][0-9]){150}"); + String input = "a1".repeat(150); + + assertDoesNotThrow(() -> matcher.getClass().getDeclaredField("DFA_TRANSITIONS")); + assertTrue(matcher.matches(input)); + assertFalse(matcher.matches(input + "x")); + assertTrue(matcher.find("xx" + input + "yy")); + + int[] bounds = new int[2]; + assertTrue(matcher.findBoundsFrom("xx" + input + "yy", 0, bounds)); + assertTrue(bounds[0] == 2 && bounds[1] == 302); + } + + @Test + void largeAlternationQuantifierDoesNotFailCompilation() { + ReggieMatcher matcher = Reggie.compile("(?:a|b|c|d|e|f){100}"); + + assertTrue(matcher.matches("a".repeat(100))); + assertTrue(matcher.matches("abcdef".repeat(16) + "abcd")); + assertFalse(matcher.matches("a".repeat(99))); + assertFalse(matcher.matches("a".repeat(101))); + } +} diff --git a/reggie-runtime/src/test/java/com/datadoghq/reggie/runtime/FallbackVerificationTest.java b/reggie-runtime/src/test/java/com/datadoghq/reggie/runtime/FallbackVerificationTest.java index ec9da76..661a889 100644 --- a/reggie-runtime/src/test/java/com/datadoghq/reggie/runtime/FallbackVerificationTest.java +++ b/reggie-runtime/src/test/java/com/datadoghq/reggie/runtime/FallbackVerificationTest.java @@ -65,12 +65,12 @@ void alternationInsideLookbehind() { assertFalse(m.find("xc")); } - // Bug 5: lookbehind + lookahead sandwich + // Bug 5 fixed: lookbehind + lookahead sandwich no longer needs fallback @Test void lookbehindLookaheadSandwich() { ReggieMatcher m = Reggie.compile("(?<=\\[)[^\\]]+(?=\\])"); - assertTrue(m instanceof JavaRegexFallbackMatcher); - assertTrue(m.find("[value]")); // was incorrectly false before fallback + assertFalse(m instanceof JavaRegexFallbackMatcher); + assertTrue(m.find("[value]")); assertFalse(m.find("value")); } diff --git a/reggie-runtime/src/test/java/com/datadoghq/reggie/runtime/LinearTokenSequenceAccessLogTest.java b/reggie-runtime/src/test/java/com/datadoghq/reggie/runtime/LinearTokenSequenceAccessLogTest.java new file mode 100644 index 0000000..c49dffe --- /dev/null +++ b/reggie-runtime/src/test/java/com/datadoghq/reggie/runtime/LinearTokenSequenceAccessLogTest.java @@ -0,0 +1,304 @@ +/* + * Copyright 2026-Present Datadog, Inc. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package com.datadoghq.reggie.runtime; + +import static org.junit.jupiter.api.Assertions.assertEquals; +import static org.junit.jupiter.api.Assertions.assertFalse; +import static org.junit.jupiter.api.Assertions.assertNotNull; +import static org.junit.jupiter.api.Assertions.assertTrue; + +import com.datadoghq.reggie.CapturePolicy; +import com.datadoghq.reggie.Reggie; +import com.datadoghq.reggie.ReggieOptions; +import com.datadoghq.reggie.codegen.parsing.RegexParser; +import java.io.IOException; +import java.io.InputStream; +import java.lang.reflect.Field; +import java.nio.charset.StandardCharsets; +import java.util.Arrays; +import java.util.Map; +import java.util.regex.Matcher; +import java.util.regex.Pattern; +import org.junit.jupiter.api.Test; + +class LinearTokenSequenceAccessLogTest { + private static final ReggieOptions NAMED_ONLY = + ReggieOptions.builder().capturePolicy(CapturePolicy.NAMED_ONLY).build(); + + private static final String COMBINED_ACCESS_LOG_PATTERN = + "(?s)(?[0-9A-Fa-f:.]+) (?\\S+) (?\\S+) " + + "\\[(?[^\\]]+)\\]\\s+\"(?\\b\\w+\\b) (?\\S+) HTTP/(?\\d+\\.\\d+)\" " + + "(?[+-]?\\d+) (?[+-]?\\d+) " + + "\"(?\\S+)\" \"(?[^\\\"]*)\" \"(?[^\\\"]*)\" \"(?[^\\\"]*)\" " + + "(?[+-]?\\d+(?:\\.\\d+)?) (?[+-]?\\d+(?:\\.\\d+)?).* " + + "\\[(?\\b\\w+\\b)\\] .*"; + + @Test + void matchesCombinedAccessLogWithDelimiterAwareCaptures() { + ReggieMatcher matcher = Reggie.compile(COMBINED_ACCESS_LOG_PATTERN, NAMED_ONLY); + String input = + "10.202.82.195 - - [15/Mar/2019:19:45:35 -0700] \"POST /config?x=y HTTP/1.1\" " + + "200 17888 \"https://example.com/index.html\" \"Mozilla/5.0 Test\" \"-\" " + + "\"tracking-id\" 0.024 0.024 . [nginx_access] [not_the_logger]"; + + int[] starts = new int[17]; + int[] ends = new int[17]; + assertTrue(matcher.matchInto(input, starts, ends)); + + assertGroup(input, starts, ends, 1, "10.202.82.195"); + assertGroup(input, starts, ends, 4, "15/Mar/2019:19:45:35 -0700"); + assertGroup(input, starts, ends, 5, "POST"); + assertGroup(input, starts, ends, 6, "/config?x=y"); + assertGroup(input, starts, ends, 7, "1.1"); + assertGroup(input, starts, ends, 8, "200"); + assertGroup(input, starts, ends, 9, "17888"); + assertGroup(input, starts, ends, 10, "https://example.com/index.html"); + assertGroup(input, starts, ends, 11, "Mozilla/5.0 Test"); + assertGroup(input, starts, ends, 12, "-"); + assertGroup(input, starts, ends, 13, "tracking-id"); + assertGroup(input, starts, ends, 14, "0.024"); + assertGroup(input, starts, ends, 15, "0.024"); + assertGroup(input, starts, ends, 16, "nginx_access"); + } + + @Test + void routesRealExpandedCommonAccessLogPatternThroughLinearTokenSequenceMatcher() + throws Exception { + ReggieMatcher matcher = Reggie.compile(testResource("logs-grok-pattern-1.regex"), NAMED_ONLY); + String input = + "10.202.82.195 - - [15/Mar/2019:19:45:35 -0700] \"POST /config?x=y HTTP/1.1\" " + + "200 17888"; + + MatchResult result = matcher.match(input); + + assertNotNull(result); + assertEquals("10.202.82.195", result.group("grok0")); + assertEquals("POST", result.group("grok4")); + assertEquals("/config?x=y", result.group("grok5")); + assertEquals("1.1", result.group("grok6")); + assertEquals("200", result.group("grok7")); + assertEquals("17888", result.group("grok8")); + assertDelegateType(matcher, LinearTokenSequenceMatcher.class); + } + + @Test + void routesRealExpandedCombinedAccessLogPatternThroughLinearTokenSequenceMatcher() + throws Exception { + ReggieMatcher matcher = Reggie.compile(testResource("logs-grok-pattern-2.regex"), NAMED_ONLY); + String input = + "10.202.82.195 - - [15/Mar/2019:19:45:35 -0700] \"POST /config?x=y HTTP/1.1\" " + + "200 17888 \"https://example.com/index.html\" \"Mozilla/5.0 Test\" \"-\" " + + "\"tracking-id\" 0.024 0.024 . [nginx_access] [not_the_logger]"; + + MatchResult result = matcher.match(input); + + assertNotNull(result); + assertEquals("10.202.82.195", result.group("grok0")); + assertEquals("POST", result.group("grok4")); + assertEquals("/config?x=y", result.group("grok5")); + assertEquals("1.1", result.group("grok6")); + assertEquals("https://example.com/index.html", result.group("grok9")); + assertEquals("Mozilla/5.0 Test", result.group("grok10")); + assertEquals("tracking-id", result.group("grok12")); + assertEquals("0.024", result.group("grok13")); + assertEquals("0.024", result.group("grok14")); + assertEquals("nginx_access", result.group("grok15")); + assertDelegateType(matcher, LinearTokenSequenceMatcher.class); + } + + @Test + void realExpandedCommonPatternHasJdkEquivalentNamedCaptureBoundaries() throws Exception { + String pattern = testResource("logs-grok-pattern-1.regex"); + assertNamedCaptureBoundariesEquivalent( + pattern, + commonMessage("10.202.82.195", "POST ", "/config?x=y", " HTTP/1.1", "17888"), + commonMessage("2001:db8::1", "", "/health", " HTTP/2.0", "-"), + commonMessage("api-1.example.com", "GET ", "/without-version", "", "42")); + } + + @Test + void realExpandedCombinedPatternHasJdkEquivalentNamedCaptureBoundaries() throws Exception { + String pattern = testResource("logs-grok-pattern-2.regex"); + assertNamedCaptureBoundariesEquivalent( + pattern, + combinedMessage( + "10.202.82.195", + "POST ", + "/config?x=y", + " HTTP/1.1", + "17888", + "https://example.com/index.html", + "Mozilla/5.0 Test", + "-", + "tracking-id", + "0.024", + "0.024", + "[decoy] . [nginx_access] [not-the-logger]"), + combinedMessage( + "2001:db8::1", + "", + "/health", + " HTTP/2.0", + "-", + "-", + "", + "", + "", + ".5", + "0.", + "[ignored] . [nginx_access] [not_the_logger]"), + combinedMessage( + "api-1.example.com", + "GET ", + "/without-version", + "", + "42", + "http://embedded.example/launcher.html", + "Agent/1.0", + "field1", + "field2", + "+12.5", + "-0.25", + "[not_the_logger] . [nginx_access] [host-with-dash]")); + } + + @Test + void leavesCallerArraysUnchangedOnNoMatch() { + ReggieMatcher matcher = Reggie.compile(COMBINED_ACCESS_LOG_PATTERN, NAMED_ONLY); + int[] starts = new int[17]; + int[] ends = new int[17]; + starts[1] = 123; + ends[1] = 456; + + assertFalse(matcher.matchInto("not an access log", starts, ends)); + + assertEquals(123, starts[1]); + assertEquals(456, ends[1]); + } + + private static void assertGroup(String input, int[] starts, int[] ends, int group, String value) { + assertEquals(value, input.substring(starts[group], ends[group])); + } + + private static void assertNamedCaptureBoundariesEquivalent(String pattern, String... inputs) + throws Exception { + RegexParser parser = new RegexParser(); + parser.parse(pattern); + Map nameToIndex = parser.getGroupNameMap(); + Pattern jdkPattern = Pattern.compile(pattern); + ReggieMatcher reggieMatcher = Reggie.compile(pattern, NAMED_ONLY); + assertDelegateTypeUnchecked(reggieMatcher, LinearTokenSequenceMatcher.class); + + for (String input : inputs) { + Matcher jdkMatcher = jdkPattern.matcher(input); + boolean jdkMatches = jdkMatcher.matches(); + int[] starts = new int[jdkMatcher.groupCount() + 1]; + int[] ends = new int[jdkMatcher.groupCount() + 1]; + Arrays.fill(starts, 777); + Arrays.fill(ends, 888); + boolean reggieMatches = reggieMatcher.matchInto(input, starts, ends); + + assertEquals(jdkMatches, reggieMatches, input); + if (!jdkMatches) { + assertTrue(Arrays.stream(starts).allMatch(value -> value == 777), input); + assertTrue(Arrays.stream(ends).allMatch(value -> value == 888), input); + continue; + } + + assertEquals(jdkMatcher.start(), starts[0], input); + assertEquals(jdkMatcher.end(), ends[0], input); + for (Map.Entry entry : nameToIndex.entrySet()) { + int group = entry.getValue(); + assertEquals(jdkMatcher.start(group), starts[group], entry.getKey() + " start: " + input); + assertEquals(jdkMatcher.end(group), ends[group], entry.getKey() + " end: " + input); + if (starts[group] >= 0) { + assertEquals( + jdkMatcher.group(group), + input.substring(starts[group], ends[group]), + entry.getKey() + " value: " + input); + } + } + } + } + + private static String commonMessage( + String client, + String methodWithSpace, + String target, + String versionWithPrefix, + String bytes) { + return client + + " - - [15/Mar/2019:19:45:35 -0700] \"" + + methodWithSpace + + target + + versionWithPrefix + + "\" 200 " + + bytes; + } + + private static String combinedMessage( + String client, + String methodWithSpace, + String target, + String versionWithPrefix, + String bytes, + String referer, + String userAgent, + String trackingId, + String upstreamTrackingId, + String duration, + String upstreamDuration, + String tail) { + return commonMessage(client, methodWithSpace, target, versionWithPrefix, bytes) + + " \"" + + referer + + "\" \"" + + userAgent + + "\" \"" + + trackingId + + "\" \"" + + upstreamTrackingId + + "\" " + + duration + + " " + + upstreamDuration + + " " + + tail; + } + + private static String testResource(String name) throws IOException { + String path = "/com/datadoghq/reggie/runtime/" + name; + try (InputStream stream = LinearTokenSequenceAccessLogTest.class.getResourceAsStream(path)) { + assertNotNull(stream, path); + return new String(stream.readAllBytes(), StandardCharsets.UTF_8).trim(); + } + } + + private static void assertDelegateType(ReggieMatcher matcher, Class expectedType) + throws Exception { + Field delegate = matcher.getClass().getDeclaredField("delegate"); + delegate.setAccessible(true); + assertEquals(expectedType, delegate.get(matcher).getClass()); + } + + private static void assertDelegateTypeUnchecked(ReggieMatcher matcher, Class expectedType) { + try { + assertDelegateType(matcher, expectedType); + } catch (Exception e) { + throw new AssertionError(e); + } + } +} diff --git a/reggie-runtime/src/test/java/com/datadoghq/reggie/runtime/LinearTokenSequenceMatcherTest.java b/reggie-runtime/src/test/java/com/datadoghq/reggie/runtime/LinearTokenSequenceMatcherTest.java new file mode 100644 index 0000000..3a2da54 --- /dev/null +++ b/reggie-runtime/src/test/java/com/datadoghq/reggie/runtime/LinearTokenSequenceMatcherTest.java @@ -0,0 +1,179 @@ +/* + * Copyright 2026-Present Datadog, Inc. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package com.datadoghq.reggie.runtime; + +import static org.junit.jupiter.api.Assertions.assertEquals; +import static org.junit.jupiter.api.Assertions.assertFalse; +import static org.junit.jupiter.api.Assertions.assertNotEquals; +import static org.junit.jupiter.api.Assertions.assertThrows; +import static org.junit.jupiter.api.Assertions.assertTrue; + +import com.datadoghq.reggie.CapturePolicy; +import com.datadoghq.reggie.Reggie; +import com.datadoghq.reggie.ReggieOptions; +import com.datadoghq.reggie.codegen.analysis.LinearTokenSequencePlan; +import com.datadoghq.reggie.codegen.analysis.PatternCategorizer; +import com.datadoghq.reggie.codegen.ast.RegexNode; +import com.datadoghq.reggie.codegen.parsing.RegexParser; +import java.lang.reflect.Field; +import java.util.Arrays; +import java.util.Map; +import org.junit.jupiter.api.Test; + +class LinearTokenSequenceMatcherTest { + + @Test + void matchesLinearTokenSequenceAndExtractsCaptureBoundaries() throws Exception { + ReggieMatcher matcher = matcherFor("host=(?\\S+) status=(?[+-]?\\d+)"); + String input = "host=api.example.com status=200"; + + int[] starts = new int[3]; + int[] ends = new int[3]; + assertTrue(matcher.matchInto(input, starts, ends)); + + assertEquals("api.example.com", input.substring(starts[1], ends[1])); + assertEquals("200", input.substring(starts[2], ends[2])); + } + + @Test + void handlesQuotedDelimiterCaptures() throws Exception { + ReggieMatcher matcher = matcherFor("referer=\"(?[^\"]*)\""); + String input = "referer=\"https://example.com/index.html\""; + + MatchResult result = matcher.match(input); + + assertEquals("https://example.com/index.html", result.group("referer")); + } + + @Test + void preservesCallerArraysOnNoMatch() throws Exception { + ReggieMatcher matcher = matcherFor("host=(?\\S+) status=(?[+-]?\\d+)"); + int[] starts = new int[] {7, 7, 7}; + int[] ends = new int[] {9, 9, 9}; + + assertFalse(matcher.matchInto("host=api.example.com status=not-a-number", starts, ends)); + + assertTrue(Arrays.equals(new int[] {7, 7, 7}, starts)); + assertTrue(Arrays.equals(new int[] {9, 9, 9}, ends)); + } + + @Test + void validatesCallerArrays() throws Exception { + ReggieMatcher matcher = matcherFor("host=(?\\S+) status=(?[+-]?\\d+)"); + + assertThrows( + IndexOutOfBoundsException.class, + () -> matcher.matchInto("host=a status=1", new int[2], new int[3])); + } + + @Test + void runtimeCompilerRoutesNamedOnlyLinearTokenSequences() throws Exception { + ReggieMatcher matcher = + Reggie.compile( + "host=(?\\S+) status=(?[+-]?\\d+)", + ReggieOptions.builder().capturePolicy(CapturePolicy.NAMED_ONLY).build()); + + MatchResult result = matcher.match("host=api.example.com status=200"); + + assertEquals("api.example.com", result.group("host")); + assertEquals("200", result.group("status")); + assertDelegateType(matcher, LinearTokenSequenceMatcher.class); + } + + @Test + void namedOnlyProjectionRunsBeforeLinearTokenRouting() throws Exception { + ReggieMatcher matcher = Reggie.compile("(?\\S+) (\\d+)", NAMED_ONLY_OPTIONS); + int[] starts = new int[] {777, 777, 777}; + int[] ends = new int[] {888, 888, 888}; + + assertTrue(matcher.matchInto("api.example.com 200", starts, ends)); + + assertEquals("api.example.com", "api.example.com 200".substring(starts[1], ends[1])); + assertEquals(-1, starts[2]); + assertEquals(-1, ends[2]); + assertDelegateType(matcher, LinearTokenSequenceMatcher.class); + } + + @Test + void capturedDashAlternativeRecordsNamedGroupSpan() throws Exception { + ReggieMatcher matcher = Reggie.compile("(?(?:-|[+-]?\\d+))", NAMED_ONLY_OPTIONS); + + MatchResult dash = matcher.match("-"); + MatchResult digits = matcher.match("42"); + + assertEquals("-", dash.group("bytes")); + assertEquals("42", digits.group("bytes")); + assertDelegateType(matcher, LinearTokenSequenceMatcher.class); + } + + @Test + void ambiguousOptionalSequencesUseGeneralRegexRoute() { + ReggieMatcher matcher = Reggie.compile("(?:a|)a", NAMED_ONLY_OPTIONS); + + assertTrue(matcher.matches("a")); + assertNotEquals(LinearTokenSequenceMatcher.class, matcher.getClass()); + } + + @Test + void runtimeCompilerRoutesCombinedAccessLogTemplateWithNonGrokNames() throws Exception { + String pattern = + "(?s)(?(?:[0-9]{1,3}\\.){3}[0-9]{1,3}|[A-Za-z0-9.-]+) " + + "(?\\S+) (?\\S+) " + + "\\[(?[\\d]{2}/(?:[jJ][aA][nN]|[mM][aA][rR])/[\\d]{4,19}:[\\d]{2}:[\\d]{2}:[\\d]{2} [+-]\\d\\d:?\\d\\d)\\]\\s+" + + "\"(?>(?\\b\\w+\\b) |)(?\\S+)(?> HTTP\\/(?\\d+\\.\\d+)|)\" " + + "(?[+-]?\\d+) (?>(?[+-]?\\d+)|-) " + + "\"(?\\S+)\" \"(?[^\\\"]*)\" \"(?[^\\\"]*)\" \"(?[^\\\"]*)\" " + + "(?[+-]?(?>\\d+(?:\\.(?:\\d*)?)?|\\.\\d+)) " + + "(?[+-]?(?>\\d+(?:\\.(?:\\d*)?)?|\\.\\d+)).* " + + "\\[(?\\b\\w+\\b)\\] .*"; + ReggieMatcher matcher = Reggie.compile(pattern, NAMED_ONLY_OPTIONS); + String input = + "10.202.82.195 - - [15/Mar/2019:19:45:35 -0700] \"POST /config?x=y HTTP/1.1\" " + + "200 17888 \"https://example.com/index.html\" \"Mozilla/5.0 Test\" \"-\" " + + "\"tracking-id\" 0.024 0.024 . [nginx_access] [not_the_logger]"; + + MatchResult result = matcher.match(input); + + assertEquals("10.202.82.195", result.group("client")); + assertEquals("POST", result.group("method")); + assertEquals("/config?x=y", result.group("target")); + assertEquals("1.1", result.group("version")); + assertEquals("https://example.com/index.html", result.group("referer")); + assertEquals("Mozilla/5.0 Test", result.group("agent")); + assertEquals("nginx_access", result.group("logger")); + assertDelegateType(matcher, LinearTokenSequenceMatcher.class); + } + + private static final ReggieOptions NAMED_ONLY_OPTIONS = + ReggieOptions.builder().capturePolicy(CapturePolicy.NAMED_ONLY).build(); + + private static void assertDelegateType(ReggieMatcher matcher, Class expectedType) + throws Exception { + Field delegate = matcher.getClass().getDeclaredField("delegate"); + delegate.setAccessible(true); + assertEquals(expectedType, delegate.get(matcher).getClass()); + } + + private static ReggieMatcher matcherFor(String pattern) throws Exception { + RegexParser parser = new RegexParser(); + RegexNode ast = parser.parse(pattern); + Map names = parser.getGroupNameMap(); + LinearTokenSequencePlan plan = + LinearTokenSequencePlan.from(PatternCategorizer.categorize(ast)).orElseThrow(); + int groupCount = names.values().stream().mapToInt(Integer::intValue).max().orElse(0); + return new LinearTokenSequenceMatcher(pattern, plan, groupCount, names); + } +} diff --git a/reggie-runtime/src/test/java/com/datadoghq/reggie/runtime/LogsBackendParserCompatibilityTest.java b/reggie-runtime/src/test/java/com/datadoghq/reggie/runtime/LogsBackendParserCompatibilityTest.java new file mode 100644 index 0000000..c779cd6 --- /dev/null +++ b/reggie-runtime/src/test/java/com/datadoghq/reggie/runtime/LogsBackendParserCompatibilityTest.java @@ -0,0 +1,74 @@ +/* + * Copyright 2026-Present Datadog, Inc. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package com.datadoghq.reggie.runtime; + +import static org.junit.jupiter.api.Assertions.assertFalse; +import static org.junit.jupiter.api.Assertions.assertTrue; + +import com.datadoghq.reggie.Reggie; +import org.junit.jupiter.api.BeforeEach; +import org.junit.jupiter.api.Test; + +class LogsBackendParserCompatibilityTest { + + @BeforeEach + void clearCache() { + RuntimeCompiler.clearCache(); + } + + @Test + void atomicGroupActsLikeNonCapturingGroupForLinearEngine() { + ReggieMatcher matcher = Reggie.compile("(?>\\d+)-(?>[a-z]+)"); + + assertTrue(matcher.matches("123-abc")); + assertFalse(matcher.matches("123-ABC")); + assertFalse(matcher.matches("abc-123")); + } + + @Test + void quotedLiteralEscapesDateSeparatorsFromPatternQuote() { + ReggieMatcher matcher = Reggie.compile("\\d{2}\\Q/\\E(?:Jan|Feb)\\Q/\\E\\d{4}\\Q:\\E\\d{2}"); + + assertTrue(matcher.matches("12/Jan/2026:09")); + assertTrue(matcher.matches("07/Feb/2026:23")); + assertFalse(matcher.matches("12XJan/2026:09")); + assertFalse(matcher.matches("12/Mar/2026:09")); + } + + @Test + void quotedLiteralTreatsRegexMetacharactersAsLiterals() { + ReggieMatcher matcher = Reggie.compile("prefix\\Q.*+?[]{}()|^$\\Esuffix"); + + assertTrue(matcher.matches("prefix.*+?[]{}()|^$suffix")); + assertFalse(matcher.matches("prefixAAAAAsuffix")); + } + + @Test + void quotedLiteralWorksInsideCharacterClass() { + ReggieMatcher matcher = Reggie.compile("[a\\Q-]\\Eb]+"); + + assertTrue(matcher.matches("a-]b")); + assertFalse(matcher.matches("abc")); + } + + @Test + void quotedLiteralCanRunToEndOfPattern() { + ReggieMatcher matcher = Reggie.compile("foo\\Q.bar"); + + assertTrue(matcher.matches("foo.bar")); + assertFalse(matcher.matches("fooXbar")); + } +} diff --git a/reggie-runtime/src/test/java/com/datadoghq/reggie/runtime/LookbehindVariantsTest.java b/reggie-runtime/src/test/java/com/datadoghq/reggie/runtime/LookbehindVariantsTest.java index e2a88ab..93f0af2 100644 --- a/reggie-runtime/src/test/java/com/datadoghq/reggie/runtime/LookbehindVariantsTest.java +++ b/reggie-runtime/src/test/java/com/datadoghq/reggie/runtime/LookbehindVariantsTest.java @@ -190,6 +190,22 @@ void positiveLookbehindNegativeLookahead() { assertTrue(m.find("3abc")); } + @Test + void ipv4DigitBoundariesUseNativeDfaSwitchAssertions() { + ReggieMatcher m = + Reggie.compile( + "(? matcher.matchInto("ab", starts, ends)); + } + + @Test + void javaRegexFallbackMatcherPopulatesCallerArrays() { + ReggieMatcher matcher = new JavaRegexFallbackMatcher("(a)?b", "test"); + int[] starts = new int[2]; + int[] ends = new int[2]; + + assertTrue(matcher.matchInto("b", starts, ends)); + + assertArrayEquals(new int[] {0, -1}, starts); + assertArrayEquals(new int[] {1, -1}, ends); + } +} diff --git a/reggie-runtime/src/test/java/com/datadoghq/reggie/runtime/OptionalGroupBackrefTest.java b/reggie-runtime/src/test/java/com/datadoghq/reggie/runtime/OptionalGroupBackrefTest.java index e00bba9..91af23e 100644 --- a/reggie-runtime/src/test/java/com/datadoghq/reggie/runtime/OptionalGroupBackrefTest.java +++ b/reggie-runtime/src/test/java/com/datadoghq/reggie/runtime/OptionalGroupBackrefTest.java @@ -24,20 +24,20 @@ * Tests for OPTIONAL_GROUP_BACKREF strategy. Patterns like (a)?\1 where backreference refers to * optional group. * - *

PCRE Semantics: - If optional group matched: backref must match captured content - If optional - * group didn't match: backref matches empty string + *

Java semantics (verified against JDK): - If optional group matched: backref must match + * captured content - If optional group did NOT participate (was skipped): backref FAILS to match */ class OptionalGroupBackrefTest { @Test void testSimpleOptionalBackref() { // (a)?\1 - optional 'a', then backref - // Matches: "" (group not matched, backref matches empty) + // Does NOT match: "" (group didn't participate — \1 fails per Java semantics) // Matches: "aa" (group matched 'a', backref matches 'a') // Does NOT match: "a" (group matched 'a', backref expects 'a' but only empty remains) ReggieMatcher m = Reggie.compile("(a)?\\1"); - assertTrue(m.matches(""), "(a)?\\1 should match '' (empty - group not matched, backref empty)"); + assertFalse(m.matches(""), "(a)?\\1 should NOT match '' — unmatched group makes \\1 fail"); assertTrue(m.matches("aa"), "(a)?\\1 should match 'aa' (group='a', backref='a')"); assertFalse(m.matches("a"), "Should not match 'a' (group='a' but no room for backref)"); assertFalse(m.matches("ab"), "Should not match 'ab'"); @@ -49,7 +49,7 @@ void testOptionalBackrefWithDifferentChar() { // (x)?\1 ReggieMatcher m = Reggie.compile("(x)?\\1"); - assertTrue(m.matches(""), "Should match empty string"); + assertFalse(m.matches(""), "Should NOT match empty string — unmatched group makes \\1 fail"); assertTrue(m.matches("xx"), "Should match 'xx'"); assertFalse(m.matches("x"), "Should not match 'x'"); assertFalse(m.matches("xy"), "Should not match 'xy'"); @@ -58,19 +58,18 @@ void testOptionalBackrefWithDifferentChar() { @Test void testMultipleOptionalGroups() { // (a)?(b)?\1\2 - two optional groups with backrefs - // Pattern matches sequentially: try (a)?, then (b)?, then \1, then \2 - // Matches: "" (neither matched, both backrefs match empty) - // Matches: "aa" (group1='a', group2=unmatched, \1='a', \2=empty) - // Matches: "bb" (group1=unmatched, group2='b', \1=empty, \2='b') + // Java semantics: when an optional group didn't participate, \N to it FAILS. // Matches: "abab" (group1='a', group2='b', \1='a', \2='b') - // Does NOT match: "aabb" (group1='a' at 0, group2 fails at 1 since 'a'!='b', - // \1='a' at 1, \2=empty, pos=2 != len=4) + // Does NOT match "" (neither group participated → \1 and \2 fail) + // Does NOT match "aa" (group2 unmatched → \2 fails) + // Does NOT match "bb" (group1 unmatched → \1 fails) + // Does NOT match "aabb" ReggieMatcher m = Reggie.compile("(a)?(b)?\\1\\2"); - assertTrue(m.matches(""), "Should match '' (neither group matched)"); - assertTrue(m.matches("aa"), "Should match 'aa' (first matched, second not)"); - assertTrue(m.matches("bb"), "Should match 'bb' (first not, second matched)"); - assertFalse(m.matches("aabb"), "Should NOT match 'aabb' (group2 fails at pos 1)"); + assertFalse(m.matches(""), "Should NOT match '' (unmatched groups make backrefs fail)"); + assertFalse(m.matches("aa"), "Should NOT match 'aa' (group2 unmatched → \\2 fails)"); + assertFalse(m.matches("bb"), "Should NOT match 'bb' (group1 unmatched → \\1 fails)"); + assertFalse(m.matches("aabb"), "Should NOT match 'aabb'"); assertTrue(m.matches("abab"), "Should match 'abab' (both matched sequentially)"); assertFalse(m.matches("a"), "Should not match 'a'"); assertFalse(m.matches("ab"), "Should not match 'ab'"); @@ -82,7 +81,7 @@ void testOptionalGroupWithAnchors() { // ^(a)?\1$ - anchored ReggieMatcher m = Reggie.compile("^(a)?\\1$"); - assertTrue(m.matches(""), "Should match ''"); + assertFalse(m.matches(""), "Should NOT match '' — unmatched group makes \\1 fail"); assertTrue(m.matches("aa"), "Should match 'aa'"); assertFalse(m.matches("a"), "Should not match 'a'"); assertFalse(m.matches("aaa"), "Should not match 'aaa'"); @@ -94,7 +93,7 @@ void testFind() { ReggieMatcher m = Reggie.compile("(a)?\\1"); assertTrue(m.find("xaay"), "Should find 'aa' in 'xaay'"); - assertTrue(m.find("xy"), "Should find '' in 'xy' (empty match)"); + assertFalse(m.find("xy"), "Should NOT find in 'xy' — no 'a' so group never matches"); assertTrue(m.find("aa"), "Should find in 'aa'"); } @@ -102,7 +101,7 @@ void testFind() { void testFindFrom() { ReggieMatcher m = Reggie.compile("(a)?\\1"); - // Note: This pattern can match empty string, so it will find at position 0 + // Note: match at pos=1 ("aa" in "xaay") int pos = m.findFrom("xaay", 0); assertTrue(pos >= 0, "Should find a match"); } @@ -110,17 +109,15 @@ void testFindFrom() { @Test void testBackrefOrderMatters() { // The pattern (a)?(b)?\2\1 has backrefs in different order - // Pattern matches: try (a)?, then (b)?, then \2, then \1 - // This test verifies correct group-to-backref mapping + // Java semantics: when an optional group didn't participate, \N to it FAILS. + // Matches: "abba" (group1='a', group2='b', \2='b', \1='a') + // Does NOT match "" (neither group participated → backrefs fail) + // Does NOT match "bb" (group1 unmatched → \1 fails) ReggieMatcher m = Reggie.compile("(a)?(b)?\\2\\1"); - assertTrue(m.matches(""), "Should match '' (neither group matched)"); - // "ba" does NOT match: (a)? fails at 'b', (b)? matches 'b' at pos 0, - // then \2 needs 'b' at pos 1 but input[1]='a' → FAIL + assertFalse(m.matches(""), "Should NOT match '' (unmatched groups make backrefs fail)"); assertFalse(m.matches("ba"), "Should NOT match 'ba' (backref \\2 needs 'b' but finds 'a')"); - // "bb" matches: (a)? fails, (b)? matches 'b', \2='b', \1=empty - assertTrue(m.matches("bb"), "Should match 'bb' (group2='b', group1 unmatched)"); - // "abba" matches: (a)? matches 'a', (b)? matches 'b', \2='b', \1='a' + assertFalse(m.matches("bb"), "Should NOT match 'bb' (group1 unmatched → \\1 fails)"); assertTrue(m.matches("abba"), "Should match 'abba'"); } } diff --git a/reggie-runtime/src/test/java/com/datadoghq/reggie/runtime/PCREParityDebugTest.java b/reggie-runtime/src/test/java/com/datadoghq/reggie/runtime/PCREParityDebugTest.java index 9534337..65a491c 100644 --- a/reggie-runtime/src/test/java/com/datadoghq/reggie/runtime/PCREParityDebugTest.java +++ b/reggie-runtime/src/test/java/com/datadoghq/reggie/runtime/PCREParityDebugTest.java @@ -122,13 +122,12 @@ void testGreedyVsGroups() { @Test void testOptionalQuantifiedGroup() { - // Pattern (a+|b){0,1} on "AB" (case insensitive implied by test?) - // Actually this should fail on "AB" because no case insensitivity + // Pattern (a+|b){0,1} — finds first match in "ab" + // find() should match "a" via the a+ branch (first alternative, one occurrence) ReggieMatcher m = Reggie.compile("(a+|b){0,1}"); - MatchResult r = m.match("ab"); // Use lowercase + MatchResult r = m.findMatch("ab"); - // {0,1} means 0 or 1 match - could match empty or "a" - // Greedy should match "a" + // Greedy {0,1} tries 1 occurrence first; a+ matches "a" assertNotNull(r); System.out.println("Group 0: '" + r.group(0) + "'"); System.out.println("Group 1: '" + r.group(1) + "'"); diff --git a/reggie-runtime/src/test/java/com/datadoghq/reggie/runtime/StringAnchorsTest.java b/reggie-runtime/src/test/java/com/datadoghq/reggie/runtime/StringAnchorsTest.java index 36965ac..1f26c45 100644 --- a/reggie-runtime/src/test/java/com/datadoghq/reggie/runtime/StringAnchorsTest.java +++ b/reggie-runtime/src/test/java/com/datadoghq/reggie/runtime/StringAnchorsTest.java @@ -67,9 +67,10 @@ public void testStringEnd() { assertTrue(m.matches("abc")); assertTrue(m.matches("xyzabc")); - assertTrue(m.matches("abc\n")); // \Z matches before final newline + assertFalse( + m.matches("abc\n")); // matches() requires consuming the full input; \n is not consumed assertFalse(m.matches("abcx")); - assertFalse(m.matches("abc\nx")); // \Z doesn't match before non-final newline + assertFalse(m.matches("abc\nx")); } @Test @@ -83,8 +84,9 @@ public void testStringEndVsAbsoluteEnd() { assertTrue(mZ.matches(input1)); assertTrue(mz.matches(input1)); - // \Z matches "abc\n", but \z doesn't - assertTrue(mZ.matches(input2), "\\Z should match before final newline"); + // Neither \Z nor \z matches "abc\n" in matches() mode: the trailing '\n' is not consumed + assertFalse( + mZ.matches(input2), "\\Z matches() requires full input; trailing \\n is not consumed"); assertFalse(mz.matches(input2), "\\z should not match before newline"); } diff --git a/reggie-runtime/src/test/java/com/datadoghq/reggie/runtime/UnsupportedPatternExceptionTest.java b/reggie-runtime/src/test/java/com/datadoghq/reggie/runtime/UnsupportedPatternExceptionTest.java new file mode 100644 index 0000000..8315391 --- /dev/null +++ b/reggie-runtime/src/test/java/com/datadoghq/reggie/runtime/UnsupportedPatternExceptionTest.java @@ -0,0 +1,36 @@ +/* + * Copyright 2026-Present Datadog, Inc. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package com.datadoghq.reggie.runtime; + +import static org.junit.jupiter.api.Assertions.assertThrows; + +import com.datadoghq.reggie.Reggie; +import com.datadoghq.reggie.UnsupportedPatternException; +import org.junit.jupiter.api.BeforeEach; +import org.junit.jupiter.api.Test; + +class UnsupportedPatternExceptionTest { + + @BeforeEach + void clearCache() { + RuntimeCompiler.clearCache(); + } + + @Test + void unsupportedConstructThrowsPublicExceptionType() { + assertThrows(UnsupportedPatternException.class, () -> Reggie.compile("(?C)")); + } +} diff --git a/reggie-runtime/src/test/resources/com/datadoghq/reggie/runtime/logs-grok-pattern-1.regex b/reggie-runtime/src/test/resources/com/datadoghq/reggie/runtime/logs-grok-pattern-1.regex new file mode 100644 index 0000000..f3c3db6 --- /dev/null +++ b/reggie-runtime/src/test/resources/com/datadoghq/reggie/runtime/logs-grok-pattern-1.regex @@ -0,0 +1 @@ +(?s)(?(?:(?:((([0-9A-Fa-f]{1,4}:){7}([0-9A-Fa-f]{1,4}|:))|(([0-9A-Fa-f]{1,4}:){6}(:[0-9A-Fa-f]{1,4}|((25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)(\.(25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)){3})|:))|(([0-9A-Fa-f]{1,4}:){5}(((:[0-9A-Fa-f]{1,4}){1,2})|:((25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)(\.(25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)){3})|:))|(([0-9A-Fa-f]{1,4}:){4}(((:[0-9A-Fa-f]{1,4}){1,3})|((:[0-9A-Fa-f]{1,4})?:((25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)(\.(25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)){3}))|:))|(([0-9A-Fa-f]{1,4}:){3}(((:[0-9A-Fa-f]{1,4}){1,4})|((:[0-9A-Fa-f]{1,4}){0,2}:((25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)(\.(25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)){3}))|:))|(([0-9A-Fa-f]{1,4}:){2}(((:[0-9A-Fa-f]{1,4}){1,5})|((:[0-9A-Fa-f]{1,4}){0,3}:((25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)(\.(25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)){3}))|:))|(([0-9A-Fa-f]{1,4}:){1}(((:[0-9A-Fa-f]{1,4}){1,6})|((:[0-9A-Fa-f]{1,4}){0,4}:((25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)(\.(25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)){3}))|:))|(:(((:[0-9A-Fa-f]{1,4}){1,7})|((:[0-9A-Fa-f]{1,4}){0,5}:((25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)(\.(25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)){3}))|:)))(%.+)?|(?\S+) (?\S+) \[(?[\d]{2}\Q/\E(?:[jJ][aA][nN]|[fF][eE][bB]|[mM][aA][rR]|[aA][pP][rR]|[mM][aA][yY]|[jJ][uU][nN]|[jJ][uU][lL]|[aA][uU][gG]|[sS][eE][pP]|[oO][cC][tT]|[nN][oO][vV]|[dD][eE][cC]|)\Q/\E[\d]{4,19}\Q:\E[\d]{2}\Q:\E[\d]{2}\Q:\E[\d]{2}\Q \E(?:Z|[+-]\d\d:?\d\d))\]\s+"(?>(?\b\w+\b) |)(?\S+)(?> HTTP\/(?\d+\.\d+)|)" (?[+-]?\d+) (?>(?[+-]?\d+)|-) \ No newline at end of file diff --git a/reggie-runtime/src/test/resources/com/datadoghq/reggie/runtime/logs-grok-pattern-2.regex b/reggie-runtime/src/test/resources/com/datadoghq/reggie/runtime/logs-grok-pattern-2.regex new file mode 100644 index 0000000..58b1d8e --- /dev/null +++ b/reggie-runtime/src/test/resources/com/datadoghq/reggie/runtime/logs-grok-pattern-2.regex @@ -0,0 +1 @@ +(?s)(?(?:(?:((([0-9A-Fa-f]{1,4}:){7}([0-9A-Fa-f]{1,4}|:))|(([0-9A-Fa-f]{1,4}:){6}(:[0-9A-Fa-f]{1,4}|((25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)(\.(25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)){3})|:))|(([0-9A-Fa-f]{1,4}:){5}(((:[0-9A-Fa-f]{1,4}){1,2})|:((25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)(\.(25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)){3})|:))|(([0-9A-Fa-f]{1,4}:){4}(((:[0-9A-Fa-f]{1,4}){1,3})|((:[0-9A-Fa-f]{1,4})?:((25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)(\.(25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)){3}))|:))|(([0-9A-Fa-f]{1,4}:){3}(((:[0-9A-Fa-f]{1,4}){1,4})|((:[0-9A-Fa-f]{1,4}){0,2}:((25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)(\.(25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)){3}))|:))|(([0-9A-Fa-f]{1,4}:){2}(((:[0-9A-Fa-f]{1,4}){1,5})|((:[0-9A-Fa-f]{1,4}){0,3}:((25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)(\.(25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)){3}))|:))|(([0-9A-Fa-f]{1,4}:){1}(((:[0-9A-Fa-f]{1,4}){1,6})|((:[0-9A-Fa-f]{1,4}){0,4}:((25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)(\.(25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)){3}))|:))|(:(((:[0-9A-Fa-f]{1,4}){1,7})|((:[0-9A-Fa-f]{1,4}){0,5}:((25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)(\.(25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)){3}))|:)))(%.+)?|(?\S+) (?\S+) \[(?[\d]{2}\Q/\E(?:[jJ][aA][nN]|[fF][eE][bB]|[mM][aA][rR]|[aA][pP][rR]|[mM][aA][yY]|[jJ][uU][nN]|[jJ][uU][lL]|[aA][uU][gG]|[sS][eE][pP]|[oO][cC][tT]|[nN][oO][vV]|[dD][eE][cC]|)\Q/\E[\d]{4,19}\Q:\E[\d]{2}\Q:\E[\d]{2}\Q:\E[\d]{2}\Q \E(?:Z|[+-]\d\d:?\d\d))\]\s+"(?>(?\b\w+\b) |)(?\S+)(?> HTTP\/(?\d+\.\d+)|)" (?[+-]?\d+) (?>(?[+-]?\d+)|-) "(?\S+)" "(?[^\"]*)" "(?[^\"]*)" "(?[^\"]*)" (?[+-]?(?>\d+(?:\.(?:\d*)?)?|\.\d+)) (?[+-]?(?>\d+(?:\.(?:\d*)?)?|\.\d+)).* \[(?\b\w+\b)\] .* \ No newline at end of file