Improve runtime compatibility, capture extraction, and token-sequence execution#68
Conversation
SubsetConstructor now tracks the weakest anchor conjunction required to reach each NFA state during ε-closure. END/STRING_END_ABSOLUTE paths followed by a consumer are pruned at construction; START-class paths get a per-transition entry guard; accept conditions propagate into per-DFA-state acceptanceAnchorConditions. The two DFA codegens emit those checks per state and drop the legacy global hasEndAnchor gate at accept sites. Fixes bare \$ matching at [0,0), \$X behaving as X\$, and the \$X|Y branch poisoning that broke \$[^a-zA-Z0-9]|^[0-9]. NFA.requiresAnchorOnAllPaths no longer vacuously reports true for patterns with no char transitions, so the find loop reaches its empty-match-at-end handler for bare \$. Adds AnchorRegressionTest cross-checking against java.util.regex.Pattern and AnchorPlacementBenchmark covering the affected pattern shapes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…guard
Two find()-path bugs in bounded quantifiers, both pre-existing on main:
1. STATELESS_LOOP's generateFindMatchFromMethod greedy-extended the
match end past the quantifier's upper bound. {0-9}{5} matched all
digits, {0-9}{5,7} matched up to the input length. Cap the matchEnd
scan at matchStart + maxReps; the matches()/find()/findBoundsFrom
variants already had this check.
2. SWARPatternAnalyzer returned a MultiRangeOptimization for any
multi-range CharSet, but MultiRangeOptimization only emits correct
bytecode for [a-zA-Z] and [a-zA-Z0-9]. Any other shape silently
falls back to scanning the first range only, so {[-_]?[0-9]{5,99}}
compiled to a SWAR loop searching for '-' alone and missed every
input that started with a digit or '_'. Gate MultiRangeOptimization
to the two supported shapes; other multi-range cases now use the
slower-but-correct charAt filter.
Adds BoundedQuantifierRegressionTest cross-checking against JDK Pattern.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Grammar-driven regex generator over a small alphabet (a/b/c/0/1/-/_), bounded depth, with the JDK as oracle. Each (pattern, input) is fed through Reggie.matches() and Reggie.findMatch() and compared to Pattern.matches() / Matcher.find(); divergences land in a Finding list. Patterns either engine rejects are skipped, not failed. Smoke test runs 500 patterns × 8 inputs deterministically (seed 0xC0DEFEED_DEADBEEFL) — about 2 seconds. Findings are printed for triage; the test only fails on a runaway regression (> 25% finding rate). The current default seed surfaces ~160 divergences from known pre-existing bugs in non-greedy quantification, quantified anchors, negated char-classes, and weird backref placements — seed material to triage and fix. Plans for the fuzz-test framework and the related sub-2× perf candidates landed under doc/plans/. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Single-char-deletion shrinker iterated to a fixpoint reduces each divergent (pattern, input) pair to its minimal still-failing form; the AlgorithmicFuzzTest dedupes shrunk findings before printing. On the default seed: 161 raw findings collapse to 64 unique minimal repros, most 4-6 chars long. doc/plans/fuzz-findings-triage.md groups the minimal repros into six categories by likely root cause (lazy quantifiers, zero-width matches, negated char-class bound zero, self-referencing backrefs, quantified anchors, anchor placement) and recommends an execution order — starting with lazy quantifiers, which is the largest cluster and probably a single root cause. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ThompsonBuilder.buildCountedQuantifier built (max - min) optional
copies for {n,m} but only marked fragments[1..] as optional via the
i >= min check inside the chain loop. fragments[0] itself was always
required, so `c{0,3}` could match 1/2/3 c's but never 0, and any
pattern of the form `prefix X{0,N}` failed against a prefix-only
input — e.g. `[ab]c{0,3}` against "a", `[^c]c{0,3}` against "b".
Adds an explicit "0-reps bypass" by inserting the first fragment's
entry into allExits when min == 0. The whole counted-quantifier
fragment now exposes both its real exits and its entry, so traversal
with zero iterations is recognized end-to-end during DFA construction.
Regression test added to BoundedQuantifierRegressionTest. The fuzz
suite goes from 161 to 155 divergences on the default seed.
Cat A (lazy quantifiers) was investigated but defers to a follow-up;
three design options documented in doc/plans/fuzz-findings-triage.md.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…branch requires \A The position-skip optimization in DFAUnrolled / DFASwitch findFrom was reading `hasStringStartAnchor` (any \A anywhere in the pattern) in addition to `requiresStartAnchor`. For patterns like `]\A|b` where only one branch needs \A but the other can match anywhere, this made find() return -1 at every non-zero position — masking the always-valid branch entirely. `requiresStartAnchor()` already treats both ^ and \A as barriers in the all-paths analysis, so it returns true only when every viable path requires one of them. Using just `requiresStartAnchor` is the sound condition. Drop the `hasStringStartAnchor` or-arm. Also: stop the fuzz generator from emitting self-referencing backrefs (e.g. (\1\1)) — JDK and Reggie disagree on these semantically-pathological shapes and the disagreement is documented as accepted divergence (Cat D in the triage doc). Fuzz divergences drop from ~156 to ~135 on the default seed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ed min - Cat A: lazyFindMode runtime flag — find() returns min match, matches() extends greedily - Cat E: zero-width greedy loop keeps counting until min reached
…ck, NFA zero-width + backref semantics
- SWARPatternAnalyzer: disable LiteralSetOptimization for 2-4 chars; it only
searched literals[0], causing find() to miss positions for other chars
(e.g. _*0|... scanning for '-' only, missing '0')
- FallbackPatternDetector: detect cross-alternative backrefs (\N in alt-i
when group N is defined in alt-j≠i) and route to JDK for both
OPTIMIZED_NFA_WITH_BACKREFS and RECURSIVE_DESCENT; Thompson NFA shared
group state and RD backtracking both produce wrong results in this case
- NFABytecodeGenerator.generateFindMatchFromMethod: start matchEnd at
matchStart (not matchStart+1) so zero-width matches are tried; use
matchStart-1 as longestEnd sentinel; null-return check IF_ICMPGE not ICMPNE
- RecursiveDescentBytecodeGenerator: unset group in backref returns -1
(JDK: fail) instead of pos (PCRE: match empty)
- LinearPatternAnalyzer.visitLiteral: skip epsilon LiteralNode('\0') that
the parser emits for empty group body (){n}
- NFA.contentHashCode: include state.backrefCheck so patterns differing
only in referenced group number don't share an L2 structural-cache entry
Fuzz findings: 18 → 4 (2 unique repros), well below 10% ceiling.
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…y fallback doc
- SubsetConstructor: set anchorConditionDiluted when differing-anchor contributors
share a partition slice or accept site (transition guard / acceptance intersection
collapses to unconditional); DFA carries the flag via new constructor param
- PatternAnalyzer: check dfa.isAnchorConditionDiluted() and alternation-priority
conflict in both the plain-DFA and tagged-DFA paths; flag MatchingStrategyResult
so RuntimeCompiler routes to JavaRegexFallbackMatcher
- DFAUnrolledBytecodeGenerator: remove the STRING_END early-return from
generateStateCode(); matches() requires the full input to be consumed, so the
"before-final-newline" path is invalid there — the end-of-input handler already
accepts when pos == length
- StringAnchorsTest: correct assertions to match JDK semantics (abc\Z matches("abc\n")
is false; the trailing \n is not consumed by \Z in matches() mode)
- AnchorRegressionTest: add four Cat-E/F anchor-dilution regression cases and a new
testStringEndMatchesMode_doesNotConsumeTrailingNewline block that cross-checks
Reggie against JDK for \Z in matches()
- FallbackPatternDetector: add hasNullableBackrefGroup() — OPTIMIZED_NFA_WITH_BACKREFS
falls back when \N references a nullable group; shared group arrays record the greedy
(non-empty) capture, causing the zero-length backref path to use the wrong span
- FallbackPatternDetector: document why lazy quantifiers remain in JDK fallback —
RECURSIVE_DESCENT lacks general alternation backtracking; attempted removal exposed 36
distinct failures all rooted in the same (a|ab)-style commitment problem
- NFAFallbackPatterns: relax xmlTags() to greedy .* with a comment explaining the
original .*? falls back to java.util.regex and why
- ReggieMatcherBytecodeGeneratorTest: replace \d+? (now JDK fallback) with (\d+)\1{1,2}
which routes to RECURSIVE_DESCENT via hasQuantifiedBackrefs
- Fuzz ceiling tightened from 25% to 10% now that Cat-E/F findings are resolved
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…ptionalGroupBackref
Six bug classes fixed, all confirmed by 0 findings in the 5 000-pattern fuzz sweep:
1. $ / \Z in extractQuantifierFromPattern: anchor-skip was too broad; END-type anchors
before a char consumer make patterns unmatchable at non-end positions. extractQuantifier
FromPattern now returns null for END/STRING_END, routing them to DFA/NFA instead of
SPECIALIZED_QUANTIFIED_GROUP which was unaware of the constraint. FallbackPatternDetector
adds a corresponding end-anchor-before-consumer rule for residual cases.
2. Negated CharClassNode ([^x]) in SPECIALIZED_QUANTIFIED_GROUP: detectQuantifiedCapturing
Group was discarding CharClassNode.negated, so isNegatedCharSet() always returned false
for simple char-class groups. Generator then used the wrong negation direction — matching
only the excluded char instead of everything else. Fix: propagate isNegatedCC to the full
QuantifiedGroupInfo constructor and use info.isNegatedCharSet() in the generator.
3. ({0}) zero-max quantifier: GreedyCharClassBytecodeGenerator accepted max==0 and produced
code that could never return a non-null match. detectGreedyCharClass now returns null for
max==0; the pattern falls through to a strategy (SPECIALIZED_CONCAT_GREEDY_GROUP) that
correctly emits an always-empty match.
4. SPECIALIZED_GREEDY_CHARCLASS findFrom for min=0: the char-scan loop skipped every
position where the first char wasn't in the class, missing the valid empty match that
min=0 (*) always yields at the scan start. generateFindFromMethod now returns start
immediately when minMatches==0.
5. Lazy quantifiers in OPTIMIZED_NFA_WITH_BACKREFS: findMatchFromMethod returns the LONGEST
match; lazy patterns need the SHORTEST. Extended the FallbackPatternDetector lazy rule to
also cover OPTIMIZED_NFA_WITH_BACKREFS, so b+?|()(\1) and similar route to JDK.
6. (X)?\1 OptionalGroupBackref with non-participating group: generator treated the "group
not matched" path as "backref satisfied" (vacuously matching empty), contrary to Java
semantics where \N to a non-participating group FAILS. FallbackPatternDetector now routes
OPTIONAL_GROUP_BACKREF patterns where the group content is non-nullable to JDK. Updated
OptionalGroupBackrefTest to match verified JDK behaviour.
Additional: containsOptionalQuantifier added to the groups-path alternation-priority
conflict check, catching DFA over-greed in patterns like .([a]?[0-b]{3})+ where the
optional [a]? inside a repeating group creates implicit alternation. The outer quantifier
fixed-count / unbounded-inner guard narrowed so ([ab]+)+ (unbounded outer) is not affected.
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…lbacks
The containsOptionalQuantifier check introduced to fix .([a]?[0-b]{3})+ was
too broad: it also flagged (a*b*c*d*e*) and similar patterns where the DFA
result is correct, routing them to JavaRegexFallbackMatcher and cutting
benchmark throughput by ~200x for those patterns.
The divergence only occurs when an optional quantifier sits INSIDE a group
that is itself in a repeating quantifier (outer + or * or {n,m} with max>1).
Without the repeating outer loop, the DFA cannot accumulate extra chars via
ambiguous optional paths. Replace the broad walk with
hasOptionalInsideRepeatingGroup which only fires for the pattern:
QuantifierNode(max>1, child=GroupNode(...optional inside...))
- (a*b*c*d*e*): group not in a repeating quantifier → NOT flagged (181k ops/ms restored)
- .([a]?[0-b]{3})+: [a]? inside (...)+ → still flagged → JDK fallback ✓
Fuzz: 0 findings on 5 000-pattern sweep.
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a4cb2184e8
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
Pull request overview
This PR significantly expands Reggie’s runtime and codegen capabilities to better handle large, capture-heavy (Grok-style) patterns with improved compatibility, lower-allocation capture extraction, and new deterministic “linear token sequence” execution routing.
Changes:
- Adds runtime API enhancements:
matchInto/findMatchInto,CapturePolicy+ReggieOptions, and a publicUnsupportedPatternExceptionfor precise unsupported-pattern handling. - Improves parser compatibility and routing: atomic groups
(?>...)(parsed as non-capturing),\Q...\Equoted literals (including inside char classes), plus expanded tests/resources for real Grok patterns. - Introduces scalability and correctness work across DFA/NFA/codegen: DFA state-budget handling + DFA_TABLE backend, structural-hash hardening, multiple anchor/quantifier/backref correctness fixes, and new fuzzing infrastructure.
Reviewed changes
Copilot reviewed 89 out of 89 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| reggie-runtime/src/test/resources/com/datadoghq/reggie/runtime/logs-grok-pattern-1.regex | Adds Grok-expanded regex fixture for regression/routing coverage. |
| reggie-runtime/src/test/resources/com/datadoghq/reggie/runtime/logs-grok-pattern-2.regex | Adds a second Grok-expanded regex fixture for regression/routing coverage. |
| reggie-runtime/src/test/java/com/datadoghq/reggie/runtime/UnsupportedPatternExceptionTest.java | Verifies unsupported constructs throw the new public exception type. |
| reggie-runtime/src/test/java/com/datadoghq/reggie/runtime/StringAnchorsTest.java | Updates expectations around matches() + \\Z / newline consumption. |
| reggie-runtime/src/test/java/com/datadoghq/reggie/runtime/PCREParityDebugTest.java | Adjusts debug test semantics to use findMatch and clarifies intent. |
| reggie-runtime/src/test/java/com/datadoghq/reggie/runtime/OptionalGroupBackrefTest.java | Aligns optional-group backref expectations to JDK semantics. |
| reggie-runtime/src/test/java/com/datadoghq/reggie/runtime/MatchIntoAPITest.java | Adds tests for matchInto/findMatchInto behavior and overrides across strategies. |
| reggie-runtime/src/test/java/com/datadoghq/reggie/runtime/LookbehindVariantsTest.java | Adds regression coverage ensuring native DFA switch path handles IPv4 digit boundaries. |
| reggie-runtime/src/test/java/com/datadoghq/reggie/runtime/LogsBackendParserCompatibilityTest.java | Tests atomic group and \\Q...\\E parsing behavior relevant to logs backend patterns. |
| reggie-runtime/src/test/java/com/datadoghq/reggie/runtime/LinearTokenSequenceMatcherTest.java | Adds tests for the new structural linear-token-sequence matcher and capture extraction. |
| reggie-runtime/src/test/java/com/datadoghq/reggie/runtime/FallbackVerificationTest.java | Updates fallback expectations for fixed lookbehind+lookahead sandwich behavior. |
| reggie-runtime/src/test/java/com/datadoghq/reggie/runtime/DFAStateBudgetFallbackTest.java | Adds tests validating DFA_TABLE routing and compilation stability for large patterns. |
| reggie-runtime/src/test/java/com/datadoghq/reggie/runtime/CapturePolicyTest.java | Verifies CapturePolicy.NAMED_ONLY indexing and capture dropping semantics. |
| reggie-runtime/src/test/java/com/datadoghq/reggie/runtime/BoundedQuantifierRegressionTest.java | Regression tests for bounded quantifier upper-bound and multi-range SWAR filtering fixes. |
| reggie-runtime/src/test/java/com/datadoghq/reggie/runtime/AnchorDiagTest.java | Adds diagnostic coverage for $-anchor cases (currently implemented as stdout-only). |
| reggie-runtime/src/main/java/com/datadoghq/reggie/UnsupportedPatternException.java | Introduces a public exception type for unsupported-but-valid regex constructs. |
| reggie-runtime/src/main/java/com/datadoghq/reggie/runtime/ReggieMatcher.java | Adds caller-owned capture-boundary APIs (matchInto/findMatchInto) and supporting reusable state. |
| reggie-runtime/src/main/java/com/datadoghq/reggie/runtime/JavaRegexFallbackMatcher.java | Implements matchInto/findMatchInto for the JDK fallback path. |
| reggie-runtime/src/main/java/com/datadoghq/reggie/runtime/HybridMatcher.java | Overrides matchInto to combine DFA pre-check + NFA capture extraction. |
| reggie-runtime/src/main/java/com/datadoghq/reggie/ReggieOptions.java | Adds runtime compilation options container (currently capture-policy focused). |
| reggie-runtime/src/main/java/com/datadoghq/reggie/Reggie.java | Adds compile/cached overloads accepting ReggieOptions. |
| reggie-runtime/src/main/java/com/datadoghq/reggie/CapturePolicy.java | Adds capture tracking policy enum including NAMED_ONLY. |
| reggie-processor/src/test/java/com/datadoghq/reggie/processor/ReggieMatcherBytecodeGeneratorTest.java | Updates processor tests to reflect routing/semantics changes (recursive descent + optional group backrefs). |
| reggie-processor/src/test/java/com/datadoghq/reggie/processor/parsing/RegexParserTest.java | Adds parser tests for atomic groups and \\Q...\\E quoted literals. |
| reggie-processor/src/main/java/com/datadoghq/reggie/processor/ReggieMatcherBytecodeGenerator.java | Wires matchInto generation and DFA_TABLE generator; adds recursive-descent state init. |
| reggie-integration-tests/src/test/java/com/datadoghq/reggie/integration/DollarAnchorCacheDiagTest.java | Adds integration diagnostics around $ anchor/cache interactions and fuzz-sweep reproduction. |
| reggie-integration-tests/src/test/java/com/datadoghq/reggie/integration/AlgorithmicFuzzTest.java | Adds oracle-based fuzz sweep against JDK regex with shrink/dedupe + ceiling guard. |
| reggie-integration-tests/src/main/java/com/datadoghq/reggie/integration/fuzz/RegexFuzzShrinker.java | Adds shrinker that reduces divergent cases to minimal repros. |
| reggie-integration-tests/src/main/java/com/datadoghq/reggie/integration/fuzz/RegexFuzzOracle.java | Adds JDK-vs-Reggie comparison oracle for matches/findMatch behavior. |
| reggie-integration-tests/src/main/java/com/datadoghq/reggie/integration/fuzz/RandomRegexGenerator.java | Adds grammar-driven deterministic regex generator for fuzzing. |
| reggie-integration-tests/src/main/java/com/datadoghq/reggie/integration/fuzz/RandomInputGenerator.java | Adds input generator aligned to fuzz regex alphabet (incl. newline). |
| reggie-integration-tests/src/main/java/com/datadoghq/reggie/integration/fuzz/FuzzRunner.java | Adds fuzz driver with configuration and finding caps. |
| reggie-integration-tests/build.gradle | Forwards -Dreggie.* JVM properties into test JVM for configurable fuzz size. |
| reggie-codegen/src/test/java/com/datadoghq/reggie/codegen/analysis/StrategySelectionTest.java | Adds tests for DFA_TABLE selection and negated-class routing expectations. |
| reggie-codegen/src/test/java/com/datadoghq/reggie/codegen/analysis/pbt/PatternRoutingPropertyBasedTest.java | Updates property-based routing expectations around large state-space behavior. |
| reggie-codegen/src/test/java/com/datadoghq/reggie/codegen/analysis/PatternRoutingPropertyTest.java | Updates DFA examples to reflect large-DFA fallback routing changes. |
| reggie-codegen/src/test/java/com/datadoghq/reggie/codegen/analysis/PatternCategorizerTest.java | Adds tests for new structural categorizer and token kind extraction. |
| reggie-codegen/src/test/java/com/datadoghq/reggie/codegen/analysis/LinearTokenSequencePlanTest.java | Adds tests for plan generation from categorization (including quoted capture folding). |
| reggie-codegen/src/test/java/com/datadoghq/reggie/codegen/analysis/FallbackPatternDetectorTest.java | Updates bug-5 regression expectation: no blanket fallback for lookbehind+lookahead. |
| reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/parsing/RegexParser.java | Adds atomic-group parse support and \\Q...\\E quoted literal parsing (incl. char classes). |
| reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/codegen/SWARPatternAnalyzer.java | Gates multi-range SWAR optimization to known-correct shapes to avoid missed matches. |
| reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/codegen/StatelessLoopBytecodeGenerator.java | Fixes bounded-quantifier upper bound handling and {0} edge cases in generated loops. |
| reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/codegen/QuantifiedGroupBytecodeGenerator.java | Fixes negated charset flag propagation for quantified groups. |
| reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/codegen/OptionalGroupBackrefBytecodeGenerator.java | Aligns optional-group backref handling with Java semantics (non-participating group fails). |
| reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/codegen/OnePassBytecodeGenerator.java | Updates $ anchor checks to handle Java’s end/before-final-newline semantics. |
| reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/codegen/LinearPatternBytecodeGenerator.java | Mirrors $ anchor semantics update in linear bytecode path. |
| reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/codegen/GreedyCharClassBytecodeGenerator.java | Fixes min=0 empty-match behavior for greedy charclass find-from generation. |
| reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/codegen/BoundedQuantifierBytecodeGenerator.java | Fixes minimum repetition enforcement for bounded quantifiers when min > 1. |
| reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/automaton/ThompsonBuilder.java | Fixes counted-quantifier min==0 epsilon-bypass path in NFA construction. |
| reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/automaton/NFA.java | Hardens anchor reachability and structural hashing of anchor/assert/backref state features. |
| reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/automaton/DFATableData.java | Adds compact transition-table representation for large pure DFAs. |
| reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/automaton/DFA.java | Adds acceptance/entry anchor guards and an anchor-condition dilution signal. |
| reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/analysis/StructuralHash.java | Moves structural hash to 64-bit and hashes enum-driven data via ordinal/bitmask. |
| reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/analysis/PatternCategorization.java | Adds categorization record for structural routing. |
| reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/analysis/PatternAtom.java | Adds semantic atom model used by categorizer and linear plan builder. |
| reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/analysis/LinearTokenSequencePlan.java | Adds executable plan representation for token-sequence matcher execution. |
| reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/analysis/LinearPatternAnalyzer.java | Ensures epsilon literal nodes (char 0) are treated as non-consuming. |
| reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/analysis/CaptureProjection.java | Adds AST rewrite to preserve named/semantic captures while dropping unobservable unnamed captures. |
| reggie-benchmark/src/main/java/com/datadoghq/reggie/benchmark/NFAMatchIntoBenchmark.java | Adds benchmark for matchInto-based capture boundary extraction. |
| reggie-benchmark/src/main/java/com/datadoghq/reggie/benchmark/NFAFallbackPatterns.java | Adjusts benchmark pattern to avoid lazy-quantifier fallback and still exercise recursive descent. |
| reggie-benchmark/src/main/java/com/datadoghq/reggie/benchmark/LogsBackendGrokBenchmark.java | Adds Grok-like access-log benchmark for native token-sequence routing + matchInto. |
| reggie-benchmark/src/main/java/com/datadoghq/reggie/benchmark/DFATableBenchmark.java | Adds performance benchmarks for large pure DFAs using DFA_TABLE backend. |
| reggie-benchmark/src/main/java/com/datadoghq/reggie/benchmark/AnchorPlacementBenchmark.java | Adds benchmarks targeting anchor-placement correctness fixes and overhead. |
| doc/plans/sub-2x-perf-candidates.md | Adds performance triage notes for candidate slow benchmarks. |
| doc/plans/logs-backend.md | Adds adoption requirements and describes new structural token-sequence route. |
| doc/plans/issue-priority.md | Adds issue prioritization plan for correctness/features. |
| doc/plans/fuzz-findings-triage.md | Adds fuzz divergence triage notes and fix directions. |
| doc/plans/fuzz-findings-triage-EF-residual.md | Documents residual anchor-condition dilution issue and possible remedies. |
| doc/plans/algorithmic-fuzz-tests-vs-jdk.md | Adds design doc describing the oracle-based fuzz approach. |
| doc/libretti/2026-05-11-datadogjava-reggie36-pcre-lookahead-combined-with-nested-alternation-produces-wro.md | Adds spec/trace doc for implemented PCRE parity fix. |
| doc/libretti/2026-05-10-datadogjava-reggie30-bug-only-first-alternative-in-lookbehind-alternation-is-chec.md | Adds spec doc for lookbehind-alternation bug. |
| doc/libretti/2026-05-10-datadogjava-reggie29-bug-unbounded-quantifier-after-lookbehind-always-fails-to-ma.md | Adds spec doc for lookbehind+unbounded-quantifier bug. |
| doc/libretti/2026-05-09-datadogjava-reggie35-pcre-inline-m-flag-inside-a-group-doesnt-activate-multiline.md | Adds spec doc for inline multiline-flag scoping bug. |
| doc/libretti/2026-05-08-datadogjava-reggie27-bug-multiple-backreferences-to-same-group-produce-false-posi.md | Adds spec doc for multiple-backref correctness bug. |
| AGENTS.md | Adds a hard rule about updating StructuralHash when DFA/NFA/PatternInfo structures change. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Summary
This PR improves Reggie's runtime compatibility and performance for large capture-heavy regexes, especially Grok-style patterns that expand into long regular expressions with many named groups.
The work spans parser compatibility, allocation-friendly capture extraction, capture projection, safer fallback behavior, DFA scalability improvements, and a new structural execution path for deterministic token-sequence patterns.
Major changes
Parser/API compatibility
(?>...)by parsing them as non-capturing groups. For Reggie's non-backtracking engines, atomic-group backtracking-prevention semantics are equivalent to(?:...).\Q...\Equoted literals, including inside character classes.UnsupportedPatternExceptionfor callers that need precise unsupported-pattern handling.Allocation-friendly capture extraction
matchInto(String input, int[] groupStarts, int[] groupEnds)for caller-owned capture-boundary arrays.Named capture projection
CapturePolicyandReggieOptions.CapturePolicy.NAMED_ONLYto preserve original named-group indexes while dropping unobservable unnamed captures from runtime capture tracking.DFA/codegen scalability and fallback behavior
Structural token-sequence execution path
Adds a generic structural route:
This route recognizes reusable deterministic token atoms such as:
The route is structural: it does not depend on exact pattern strings or specific capture names.
Correctness and regression coverage
Performance validation
Representative integrated Grok-style benchmark with native token-sequence routing:
Coverage for the benchmarked patterns:
Validation
Passed.