Improve runtime compatibility, capture extraction, and token-sequence execution by jbachorik · Pull Request #68 · DataDog/java-reggie

jbachorik · 2026-05-29T12:21:52Z

Summary

This PR improves Reggie's runtime compatibility and performance for large capture-heavy regexes, especially Grok-style patterns that expand into long regular expressions with many named groups.

The work spans parser compatibility, allocation-friendly capture extraction, capture projection, safer fallback behavior, DFA scalability improvements, and a new structural execution path for deterministic token-sequence patterns.

Major changes

Parser/API compatibility

Supports atomic groups (?>...) by parsing them as non-capturing groups. For Reggie's non-backtracking engines, atomic-group backtracking-prevention semantics are equivalent to (?:...).
Supports \Q...\E quoted literals, including inside character classes.
Adds public UnsupportedPatternException for callers that need precise unsupported-pattern handling.

Allocation-friendly capture extraction

Adds matchInto(String input, int[] groupStarts, int[] groupEnds) for caller-owned capture-boundary arrays.
Adds corresponding find/match-into support across runtime and generated paths.
Preserves the contract that caller arrays remain unchanged on no-match.

Named capture projection

Adds CapturePolicy and ReggieOptions.
Adds CapturePolicy.NAMED_ONLY to preserve original named-group indexes while dropping unobservable unnamed captures from runtime capture tracking.
Keeps named-group lookup compatible with callers that discover original group indexes from the source pattern and later read by index.

DFA/codegen scalability and fallback behavior

Adds DFA state-budget handling for oversized state spaces.
Adds a table-driven DFA backend.
Improves generated-method-too-large diagnostics with class/method/descriptor/code-size details.
Fixes lookaround/fallback behavior needed by large expanded patterns.

Structural token-sequence execution path

Adds a generic structural route:

regex AST -> PatternCategorizer -> LinearTokenSequencePlan -> LinearTokenSequenceMatcher

This route recognizes reusable deterministic token atoms such as:

literals and whitespace
non-space fields
IP/host-like fields
quoted fields
delimiter captures
signed integers and decimals
optional token subsequences
trailing bounded bracketed-word captures

The route is structural: it does not depend on exact pattern strings or specific capture names.

Correctness and regression coverage

Adds parser compatibility tests.
Adds capture-policy and match-into API tests.
Adds structural categorizer and linear-token-sequence planner/runtime tests.
Adds real expanded Grok-style regex fixtures as regression tests.
Adds JDK/Reggie named-capture boundary equivalence tests for representative large capture-heavy patterns, including optional fields and delimiter/quoted-field edge cases.
Adds/extends routing/fallback tests and algorithmic fuzzing infrastructure.

Performance validation

Representative integrated Grok-style benchmark with native token-sequence routing:

Engine	Score	Allocation
JDK regex	16.210 ± 2.128 us/op	7701.393 ± 154.805 B/op
Reggie native token sequence	2.353 ± 0.161 us/op	7682.979 ± 61.845 B/op

Coverage for the benchmarked patterns:

2/2 native, 0/2 internal JDK fallback, 0/2 supplier JDK fallback

Validation

./gradlew spotlessApply build

Passed.

SubsetConstructor now tracks the weakest anchor conjunction required to reach each NFA state during ε-closure. END/STRING_END_ABSOLUTE paths followed by a consumer are pruned at construction; START-class paths get a per-transition entry guard; accept conditions propagate into per-DFA-state acceptanceAnchorConditions. The two DFA codegens emit those checks per state and drop the legacy global hasEndAnchor gate at accept sites. Fixes bare \$ matching at [0,0), \$X behaving as X\$, and the \$X|Y branch poisoning that broke \$[^a-zA-Z0-9]|^[0-9]. NFA.requiresAnchorOnAllPaths no longer vacuously reports true for patterns with no char transitions, so the find loop reaches its empty-match-at-end handler for bare \$. Adds AnchorRegressionTest cross-checking against java.util.regex.Pattern and AnchorPlacementBenchmark covering the affected pattern shapes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…guard Two find()-path bugs in bounded quantifiers, both pre-existing on main: 1. STATELESS_LOOP's generateFindMatchFromMethod greedy-extended the match end past the quantifier's upper bound. {0-9}{5} matched all digits, {0-9}{5,7} matched up to the input length. Cap the matchEnd scan at matchStart + maxReps; the matches()/find()/findBoundsFrom variants already had this check. 2. SWARPatternAnalyzer returned a MultiRangeOptimization for any multi-range CharSet, but MultiRangeOptimization only emits correct bytecode for [a-zA-Z] and [a-zA-Z0-9]. Any other shape silently falls back to scanning the first range only, so {[-_]?[0-9]{5,99}} compiled to a SWAR loop searching for '-' alone and missed every input that started with a digit or '_'. Gate MultiRangeOptimization to the two supported shapes; other multi-range cases now use the slower-but-correct charAt filter. Adds BoundedQuantifierRegressionTest cross-checking against JDK Pattern. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Grammar-driven regex generator over a small alphabet (a/b/c/0/1/-/_), bounded depth, with the JDK as oracle. Each (pattern, input) is fed through Reggie.matches() and Reggie.findMatch() and compared to Pattern.matches() / Matcher.find(); divergences land in a Finding list. Patterns either engine rejects are skipped, not failed. Smoke test runs 500 patterns × 8 inputs deterministically (seed 0xC0DEFEED_DEADBEEFL) — about 2 seconds. Findings are printed for triage; the test only fails on a runaway regression (> 25% finding rate). The current default seed surfaces ~160 divergences from known pre-existing bugs in non-greedy quantification, quantified anchors, negated char-classes, and weird backref placements — seed material to triage and fix. Plans for the fuzz-test framework and the related sub-2× perf candidates landed under doc/plans/. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Single-char-deletion shrinker iterated to a fixpoint reduces each divergent (pattern, input) pair to its minimal still-failing form; the AlgorithmicFuzzTest dedupes shrunk findings before printing. On the default seed: 161 raw findings collapse to 64 unique minimal repros, most 4-6 chars long. doc/plans/fuzz-findings-triage.md groups the minimal repros into six categories by likely root cause (lazy quantifiers, zero-width matches, negated char-class bound zero, self-referencing backrefs, quantified anchors, anchor placement) and recommends an execution order — starting with lazy quantifiers, which is the largest cluster and probably a single root cause. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ThompsonBuilder.buildCountedQuantifier built (max - min) optional copies for {n,m} but only marked fragments[1..] as optional via the i >= min check inside the chain loop. fragments[0] itself was always required, so `c{0,3}` could match 1/2/3 c's but never 0, and any pattern of the form `prefix X{0,N}` failed against a prefix-only input — e.g. `[ab]c{0,3}` against "a", `[^c]c{0,3}` against "b". Adds an explicit "0-reps bypass" by inserting the first fragment's entry into allExits when min == 0. The whole counted-quantifier fragment now exposes both its real exits and its entry, so traversal with zero iterations is recognized end-to-end during DFA construction. Regression test added to BoundedQuantifierRegressionTest. The fuzz suite goes from 161 to 155 divergences on the default seed. Cat A (lazy quantifiers) was investigated but defers to a follow-up; three design options documented in doc/plans/fuzz-findings-triage.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…branch requires \A The position-skip optimization in DFAUnrolled / DFASwitch findFrom was reading `hasStringStartAnchor` (any \A anywhere in the pattern) in addition to `requiresStartAnchor`. For patterns like `]\A|b` where only one branch needs \A but the other can match anywhere, this made find() return -1 at every non-zero position — masking the always-valid branch entirely. `requiresStartAnchor()` already treats both ^ and \A as barriers in the all-paths analysis, so it returns true only when every viable path requires one of them. Using just `requiresStartAnchor` is the sound condition. Drop the `hasStringStartAnchor` or-arm. Also: stop the fuzz generator from emitting self-referencing backrefs (e.g. (\1\1)) — JDK and Reggie disagree on these semantically-pathological shapes and the disagreement is documented as accepted divergence (Cat D in the triage doc). Fuzz divergences drop from ~156 to ~135 on the default seed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ed min - Cat A: lazyFindMode runtime flag — find() returns min match, matches() extends greedily - Cat E: zero-width greedy loop keeps counting until min reached

…ck, NFA zero-width + backref semantics - SWARPatternAnalyzer: disable LiteralSetOptimization for 2-4 chars; it only searched literals[0], causing find() to miss positions for other chars (e.g. _*0|... scanning for '-' only, missing '0') - FallbackPatternDetector: detect cross-alternative backrefs (\N in alt-i when group N is defined in alt-j≠i) and route to JDK for both OPTIMIZED_NFA_WITH_BACKREFS and RECURSIVE_DESCENT; Thompson NFA shared group state and RD backtracking both produce wrong results in this case - NFABytecodeGenerator.generateFindMatchFromMethod: start matchEnd at matchStart (not matchStart+1) so zero-width matches are tried; use matchStart-1 as longestEnd sentinel; null-return check IF_ICMPGE not ICMPNE - RecursiveDescentBytecodeGenerator: unset group in backref returns -1 (JDK: fail) instead of pos (PCRE: match empty) - LinearPatternAnalyzer.visitLiteral: skip epsilon LiteralNode('\0') that the parser emits for empty group body (){n} - NFA.contentHashCode: include state.backrefCheck so patterns differing only in referenced group number don't share an L2 structural-cache entry Fuzz findings: 18 → 4 (2 unique repros), well below 10% ceiling. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

…y fallback doc - SubsetConstructor: set anchorConditionDiluted when differing-anchor contributors share a partition slice or accept site (transition guard / acceptance intersection collapses to unconditional); DFA carries the flag via new constructor param - PatternAnalyzer: check dfa.isAnchorConditionDiluted() and alternation-priority conflict in both the plain-DFA and tagged-DFA paths; flag MatchingStrategyResult so RuntimeCompiler routes to JavaRegexFallbackMatcher - DFAUnrolledBytecodeGenerator: remove the STRING_END early-return from generateStateCode(); matches() requires the full input to be consumed, so the "before-final-newline" path is invalid there — the end-of-input handler already accepts when pos == length - StringAnchorsTest: correct assertions to match JDK semantics (abc\Z matches("abc\n") is false; the trailing \n is not consumed by \Z in matches() mode) - AnchorRegressionTest: add four Cat-E/F anchor-dilution regression cases and a new testStringEndMatchesMode_doesNotConsumeTrailingNewline block that cross-checks Reggie against JDK for \Z in matches() - FallbackPatternDetector: add hasNullableBackrefGroup() — OPTIMIZED_NFA_WITH_BACKREFS falls back when \N references a nullable group; shared group arrays record the greedy (non-empty) capture, causing the zero-length backref path to use the wrong span - FallbackPatternDetector: document why lazy quantifiers remain in JDK fallback — RECURSIVE_DESCENT lacks general alternation backtracking; attempted removal exposed 36 distinct failures all rooted in the same (a|ab)-style commitment problem - NFAFallbackPatterns: relax xmlTags() to greedy .* with a comment explaining the original .*? falls back to java.util.regex and why - ReggieMatcherBytecodeGeneratorTest: replace \d+? (now JDK fallback) with (\d+)\1{1,2} which routes to RECURSIVE_DESCENT via hasQuantifiedBackrefs - Fuzz ceiling tightened from 25% to 10% now that Cat-E/F findings are resolved Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

…ptionalGroupBackref Six bug classes fixed, all confirmed by 0 findings in the 5 000-pattern fuzz sweep: 1. $ / \Z in extractQuantifierFromPattern: anchor-skip was too broad; END-type anchors before a char consumer make patterns unmatchable at non-end positions. extractQuantifier FromPattern now returns null for END/STRING_END, routing them to DFA/NFA instead of SPECIALIZED_QUANTIFIED_GROUP which was unaware of the constraint. FallbackPatternDetector adds a corresponding end-anchor-before-consumer rule for residual cases. 2. Negated CharClassNode ([^x]) in SPECIALIZED_QUANTIFIED_GROUP: detectQuantifiedCapturing Group was discarding CharClassNode.negated, so isNegatedCharSet() always returned false for simple char-class groups. Generator then used the wrong negation direction — matching only the excluded char instead of everything else. Fix: propagate isNegatedCC to the full QuantifiedGroupInfo constructor and use info.isNegatedCharSet() in the generator. 3. ({0}) zero-max quantifier: GreedyCharClassBytecodeGenerator accepted max==0 and produced code that could never return a non-null match. detectGreedyCharClass now returns null for max==0; the pattern falls through to a strategy (SPECIALIZED_CONCAT_GREEDY_GROUP) that correctly emits an always-empty match. 4. SPECIALIZED_GREEDY_CHARCLASS findFrom for min=0: the char-scan loop skipped every position where the first char wasn't in the class, missing the valid empty match that min=0 (*) always yields at the scan start. generateFindFromMethod now returns start immediately when minMatches==0. 5. Lazy quantifiers in OPTIMIZED_NFA_WITH_BACKREFS: findMatchFromMethod returns the LONGEST match; lazy patterns need the SHORTEST. Extended the FallbackPatternDetector lazy rule to also cover OPTIMIZED_NFA_WITH_BACKREFS, so b+?|()(\1) and similar route to JDK. 6. (X)?\1 OptionalGroupBackref with non-participating group: generator treated the "group not matched" path as "backref satisfied" (vacuously matching empty), contrary to Java semantics where \N to a non-participating group FAILS. FallbackPatternDetector now routes OPTIONAL_GROUP_BACKREF patterns where the group content is non-nullable to JDK. Updated OptionalGroupBackrefTest to match verified JDK behaviour. Additional: containsOptionalQuantifier added to the groups-path alternation-priority conflict check, catching DFA over-greed in patterns like .([a]?[0-b]{3})+ where the optional [a]? inside a repeating group creates implicit alternation. The outer quantifier fixed-count / unbounded-inner guard narrowed so ([ab]+)+ (unbounded outer) is not affected. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

…lbacks The containsOptionalQuantifier check introduced to fix .([a]?[0-b]{3})+ was too broad: it also flagged (a*b*c*d*e*) and similar patterns where the DFA result is correct, routing them to JavaRegexFallbackMatcher and cutting benchmark throughput by ~200x for those patterns. The divergence only occurs when an optional quantifier sits INSIDE a group that is itself in a repeating quantifier (outer + or * or {n,m} with max>1). Without the repeating outer loop, the DFA cannot accumulate extra chars via ambiguous optional paths. Replace the broad walk with hasOptionalInsideRepeatingGroup which only fires for the pattern: QuantifierNode(max>1, child=GroupNode(...optional inside...)) - (a*b*c*d*e*): group not in a repeating quantifier → NOT flagged (181k ops/ms restored) - .([a]?[0-b]{3})+: [a]? inside (...)+ → still flagged → JDK fallback ✓ Fuzz: 0 findings on 5 000-pattern sweep. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a4cb2184e8

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Copilot

Pull request overview

This PR significantly expands Reggie’s runtime and codegen capabilities to better handle large, capture-heavy (Grok-style) patterns with improved compatibility, lower-allocation capture extraction, and new deterministic “linear token sequence” execution routing.

Changes:

Adds runtime API enhancements: matchInto / findMatchInto, CapturePolicy + ReggieOptions, and a public UnsupportedPatternException for precise unsupported-pattern handling.
Improves parser compatibility and routing: atomic groups (?>...) (parsed as non-capturing), \Q...\E quoted literals (including inside char classes), plus expanded tests/resources for real Grok patterns.
Introduces scalability and correctness work across DFA/NFA/codegen: DFA state-budget handling + DFA_TABLE backend, structural-hash hardening, multiple anchor/quantifier/backref correctness fixes, and new fuzzing infrastructure.

Reviewed changes

Copilot reviewed 89 out of 89 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
reggie-runtime/src/test/resources/com/datadoghq/reggie/runtime/logs-grok-pattern-1.regex	Adds Grok-expanded regex fixture for regression/routing coverage.
reggie-runtime/src/test/resources/com/datadoghq/reggie/runtime/logs-grok-pattern-2.regex	Adds a second Grok-expanded regex fixture for regression/routing coverage.
reggie-runtime/src/test/java/com/datadoghq/reggie/runtime/UnsupportedPatternExceptionTest.java	Verifies unsupported constructs throw the new public exception type.
reggie-runtime/src/test/java/com/datadoghq/reggie/runtime/StringAnchorsTest.java	Updates expectations around `matches()` + `\\Z` / newline consumption.
reggie-runtime/src/test/java/com/datadoghq/reggie/runtime/PCREParityDebugTest.java	Adjusts debug test semantics to use `findMatch` and clarifies intent.
reggie-runtime/src/test/java/com/datadoghq/reggie/runtime/OptionalGroupBackrefTest.java	Aligns optional-group backref expectations to JDK semantics.
reggie-runtime/src/test/java/com/datadoghq/reggie/runtime/MatchIntoAPITest.java	Adds tests for `matchInto`/`findMatchInto` behavior and overrides across strategies.
reggie-runtime/src/test/java/com/datadoghq/reggie/runtime/LookbehindVariantsTest.java	Adds regression coverage ensuring native DFA switch path handles IPv4 digit boundaries.
reggie-runtime/src/test/java/com/datadoghq/reggie/runtime/LogsBackendParserCompatibilityTest.java	Tests atomic group and `\\Q...\\E` parsing behavior relevant to logs backend patterns.
reggie-runtime/src/test/java/com/datadoghq/reggie/runtime/LinearTokenSequenceMatcherTest.java	Adds tests for the new structural linear-token-sequence matcher and capture extraction.
reggie-runtime/src/test/java/com/datadoghq/reggie/runtime/FallbackVerificationTest.java	Updates fallback expectations for fixed lookbehind+lookahead sandwich behavior.
reggie-runtime/src/test/java/com/datadoghq/reggie/runtime/DFAStateBudgetFallbackTest.java	Adds tests validating DFA_TABLE routing and compilation stability for large patterns.
reggie-runtime/src/test/java/com/datadoghq/reggie/runtime/CapturePolicyTest.java	Verifies `CapturePolicy.NAMED_ONLY` indexing and capture dropping semantics.
reggie-runtime/src/test/java/com/datadoghq/reggie/runtime/BoundedQuantifierRegressionTest.java	Regression tests for bounded quantifier upper-bound and multi-range SWAR filtering fixes.
reggie-runtime/src/test/java/com/datadoghq/reggie/runtime/AnchorDiagTest.java	Adds diagnostic coverage for `$`-anchor cases (currently implemented as stdout-only).
reggie-runtime/src/main/java/com/datadoghq/reggie/UnsupportedPatternException.java	Introduces a public exception type for unsupported-but-valid regex constructs.
reggie-runtime/src/main/java/com/datadoghq/reggie/runtime/ReggieMatcher.java	Adds caller-owned capture-boundary APIs (`matchInto`/`findMatchInto`) and supporting reusable state.
reggie-runtime/src/main/java/com/datadoghq/reggie/runtime/JavaRegexFallbackMatcher.java	Implements `matchInto`/`findMatchInto` for the JDK fallback path.
reggie-runtime/src/main/java/com/datadoghq/reggie/runtime/HybridMatcher.java	Overrides `matchInto` to combine DFA pre-check + NFA capture extraction.
reggie-runtime/src/main/java/com/datadoghq/reggie/ReggieOptions.java	Adds runtime compilation options container (currently capture-policy focused).
reggie-runtime/src/main/java/com/datadoghq/reggie/Reggie.java	Adds `compile`/`cached` overloads accepting `ReggieOptions`.
reggie-runtime/src/main/java/com/datadoghq/reggie/CapturePolicy.java	Adds capture tracking policy enum including `NAMED_ONLY`.
reggie-processor/src/test/java/com/datadoghq/reggie/processor/ReggieMatcherBytecodeGeneratorTest.java	Updates processor tests to reflect routing/semantics changes (recursive descent + optional group backrefs).
reggie-processor/src/test/java/com/datadoghq/reggie/processor/parsing/RegexParserTest.java	Adds parser tests for atomic groups and `\\Q...\\E` quoted literals.
reggie-processor/src/main/java/com/datadoghq/reggie/processor/ReggieMatcherBytecodeGenerator.java	Wires matchInto generation and DFA_TABLE generator; adds recursive-descent state init.
reggie-integration-tests/src/test/java/com/datadoghq/reggie/integration/DollarAnchorCacheDiagTest.java	Adds integration diagnostics around `$` anchor/cache interactions and fuzz-sweep reproduction.
reggie-integration-tests/src/test/java/com/datadoghq/reggie/integration/AlgorithmicFuzzTest.java	Adds oracle-based fuzz sweep against JDK regex with shrink/dedupe + ceiling guard.
reggie-integration-tests/src/main/java/com/datadoghq/reggie/integration/fuzz/RegexFuzzShrinker.java	Adds shrinker that reduces divergent cases to minimal repros.
reggie-integration-tests/src/main/java/com/datadoghq/reggie/integration/fuzz/RegexFuzzOracle.java	Adds JDK-vs-Reggie comparison oracle for matches/findMatch behavior.
reggie-integration-tests/src/main/java/com/datadoghq/reggie/integration/fuzz/RandomRegexGenerator.java	Adds grammar-driven deterministic regex generator for fuzzing.
reggie-integration-tests/src/main/java/com/datadoghq/reggie/integration/fuzz/RandomInputGenerator.java	Adds input generator aligned to fuzz regex alphabet (incl. newline).
reggie-integration-tests/src/main/java/com/datadoghq/reggie/integration/fuzz/FuzzRunner.java	Adds fuzz driver with configuration and finding caps.
reggie-integration-tests/build.gradle	Forwards `-Dreggie.*` JVM properties into test JVM for configurable fuzz size.
reggie-codegen/src/test/java/com/datadoghq/reggie/codegen/analysis/StrategySelectionTest.java	Adds tests for DFA_TABLE selection and negated-class routing expectations.
reggie-codegen/src/test/java/com/datadoghq/reggie/codegen/analysis/pbt/PatternRoutingPropertyBasedTest.java	Updates property-based routing expectations around large state-space behavior.
reggie-codegen/src/test/java/com/datadoghq/reggie/codegen/analysis/PatternRoutingPropertyTest.java	Updates DFA examples to reflect large-DFA fallback routing changes.
reggie-codegen/src/test/java/com/datadoghq/reggie/codegen/analysis/PatternCategorizerTest.java	Adds tests for new structural categorizer and token kind extraction.
reggie-codegen/src/test/java/com/datadoghq/reggie/codegen/analysis/LinearTokenSequencePlanTest.java	Adds tests for plan generation from categorization (including quoted capture folding).
reggie-codegen/src/test/java/com/datadoghq/reggie/codegen/analysis/FallbackPatternDetectorTest.java	Updates bug-5 regression expectation: no blanket fallback for lookbehind+lookahead.
reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/parsing/RegexParser.java	Adds atomic-group parse support and `\\Q...\\E` quoted literal parsing (incl. char classes).
reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/codegen/SWARPatternAnalyzer.java	Gates multi-range SWAR optimization to known-correct shapes to avoid missed matches.
reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/codegen/StatelessLoopBytecodeGenerator.java	Fixes bounded-quantifier upper bound handling and `{0}` edge cases in generated loops.
reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/codegen/QuantifiedGroupBytecodeGenerator.java	Fixes negated charset flag propagation for quantified groups.
reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/codegen/OptionalGroupBackrefBytecodeGenerator.java	Aligns optional-group backref handling with Java semantics (non-participating group fails).
reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/codegen/OnePassBytecodeGenerator.java	Updates `$` anchor checks to handle Java’s end/before-final-newline semantics.
reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/codegen/LinearPatternBytecodeGenerator.java	Mirrors `$` anchor semantics update in linear bytecode path.
reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/codegen/GreedyCharClassBytecodeGenerator.java	Fixes `min=0` empty-match behavior for greedy charclass find-from generation.
reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/codegen/BoundedQuantifierBytecodeGenerator.java	Fixes minimum repetition enforcement for bounded quantifiers when `min > 1`.
reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/automaton/ThompsonBuilder.java	Fixes counted-quantifier `min==0` epsilon-bypass path in NFA construction.
reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/automaton/NFA.java	Hardens anchor reachability and structural hashing of anchor/assert/backref state features.
reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/automaton/DFATableData.java	Adds compact transition-table representation for large pure DFAs.
reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/automaton/DFA.java	Adds acceptance/entry anchor guards and an anchor-condition dilution signal.
reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/analysis/StructuralHash.java	Moves structural hash to 64-bit and hashes enum-driven data via ordinal/bitmask.
reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/analysis/PatternCategorization.java	Adds categorization record for structural routing.
reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/analysis/PatternAtom.java	Adds semantic atom model used by categorizer and linear plan builder.
reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/analysis/LinearTokenSequencePlan.java	Adds executable plan representation for token-sequence matcher execution.
reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/analysis/LinearPatternAnalyzer.java	Ensures epsilon literal nodes (char 0) are treated as non-consuming.
reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/analysis/CaptureProjection.java	Adds AST rewrite to preserve named/semantic captures while dropping unobservable unnamed captures.
reggie-benchmark/src/main/java/com/datadoghq/reggie/benchmark/NFAMatchIntoBenchmark.java	Adds benchmark for matchInto-based capture boundary extraction.
reggie-benchmark/src/main/java/com/datadoghq/reggie/benchmark/NFAFallbackPatterns.java	Adjusts benchmark pattern to avoid lazy-quantifier fallback and still exercise recursive descent.
reggie-benchmark/src/main/java/com/datadoghq/reggie/benchmark/LogsBackendGrokBenchmark.java	Adds Grok-like access-log benchmark for native token-sequence routing + matchInto.
reggie-benchmark/src/main/java/com/datadoghq/reggie/benchmark/DFATableBenchmark.java	Adds performance benchmarks for large pure DFAs using DFA_TABLE backend.
reggie-benchmark/src/main/java/com/datadoghq/reggie/benchmark/AnchorPlacementBenchmark.java	Adds benchmarks targeting anchor-placement correctness fixes and overhead.
doc/plans/sub-2x-perf-candidates.md	Adds performance triage notes for candidate slow benchmarks.
doc/plans/logs-backend.md	Adds adoption requirements and describes new structural token-sequence route.
doc/plans/issue-priority.md	Adds issue prioritization plan for correctness/features.
doc/plans/fuzz-findings-triage.md	Adds fuzz divergence triage notes and fix directions.
doc/plans/fuzz-findings-triage-EF-residual.md	Documents residual anchor-condition dilution issue and possible remedies.
doc/plans/algorithmic-fuzz-tests-vs-jdk.md	Adds design doc describing the oracle-based fuzz approach.
doc/libretti/2026-05-11-datadogjava-reggie36-pcre-lookahead-combined-with-nested-alternation-produces-wro.md	Adds spec/trace doc for implemented PCRE parity fix.
doc/libretti/2026-05-10-datadogjava-reggie30-bug-only-first-alternative-in-lookbehind-alternation-is-chec.md	Adds spec doc for lookbehind-alternation bug.
doc/libretti/2026-05-10-datadogjava-reggie29-bug-unbounded-quantifier-after-lookbehind-always-fails-to-ma.md	Adds spec doc for lookbehind+unbounded-quantifier bug.
doc/libretti/2026-05-09-datadogjava-reggie35-pcre-inline-m-flag-inside-a-group-doesnt-activate-multiline.md	Adds spec doc for inline multiline-flag scoping bug.
doc/libretti/2026-05-08-datadogjava-reggie27-bug-multiple-backreferences-to-same-group-produce-false-posi.md	Adds spec doc for multiple-backref correctness bug.
AGENTS.md	Adds a hard rule about updating StructuralHash when DFA/NFA/PatternInfo structures change.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

jbachorik and others added 30 commits May 26, 2026 22:52

fix: lazy quantifiers respect find-vs-matches mode + zero-width count…

218d487

…ed min - Cat A: lazyFindMode runtime flag — find() returns min match, matches() extends greedily - Cat E: zero-width greedy loop keeps counting until min reached

docs: fuzz triage, issue priority list, libretti, and anchor diag test

fc8110d

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

feat: support atomic groups and quoted literals

800fbba

feat: add matchInto capture boundary API

51101c6

feat: fall back for oversized DFA state spaces

c662831

bench: add logs backend grok benchmark

6de21bd

feat: add allocation-free NFA matchInto

1765624

feat: add recursive descent matchInto

91ea690

fix: avoid recursive descent for delimited negated captures

259ccf3

feat: add table-driven DFA backend

2e6d423

perf: skip impossible DFA table scan starts

f1d7584

feat: complete P2 parser compatibility

1265585

fix: support combined lookaround assertions

a0bcf08

chore: include method details for oversized bytecode fallback

f9b1923

feat: add capture projection options

32a7d25

feat: specialize access log grok matching

4151297

feat: add structural pattern categorizer

4351e5c

feat: classify reusable log pattern atoms

5ea1e7d

feat: add linear template planning

90da424

feat: add linear template runtime matcher

092fd5a

jbachorik added 8 commits May 29, 2026 12:56

feat: route named linear templates

fd90e24

feat: handle access log templates generically

31eec11

refactor: remove access log oracle routing

254c604

perf: reuse linear template optional scratch

a470d08

refactor: remove access log matcher oracle

6b9db9d

test: assert real grok patterns use linear templates

af8f55f

refactor: rename linear templates to token sequences

6082430

test: harden log token sequence equivalence

a4cb218

chatgpt-codex-connector Bot reviewed May 29, 2026

View reviewed changes

jbachorik changed the title ~~Adopt native token-sequence route for logs-backend Grok patterns~~ Add native token-sequence route for structured log patterns May 29, 2026

jbachorik changed the title ~~Add native token-sequence route for structured log patterns~~ Add native linear token-sequence regex execution path May 29, 2026

jbachorik changed the title ~~Add native linear token-sequence regex execution path~~ Improve runtime compatibility, capture extraction, and token-sequence execution May 29, 2026

jbachorik requested a review from Copilot May 29, 2026 12:52

Copilot started reviewing on behalf of jbachorik May 29, 2026 12:52 View session

Copilot AI reviewed May 29, 2026

View reviewed changes

Comment thread AGENTS.md

Comment thread reggie-runtime/src/test/java/com/datadoghq/reggie/runtime/AnchorDiagTest.java Outdated

Comment thread reggie-runtime/src/test/java/com/datadoghq/reggie/runtime/AnchorDiagTest.java Outdated

jbachorik added 2 commits May 29, 2026 14:57

fix: harden token sequence capture semantics

b98aef6

test: assert anchor diagnostics

a0873c1

jbachorik merged commit 5f5706c into main May 29, 2026
8 checks passed

jbachorik deleted the jb/logs-backend branch May 29, 2026 13:18

jbachorik added this to the 0.3.0 milestone May 29, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve runtime compatibility, capture extraction, and token-sequence execution#68

Improve runtime compatibility, capture extraction, and token-sequence execution#68
jbachorik merged 40 commits into
mainfrom
jb/logs-backend

jbachorik commented May 29, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jbachorik commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Major changes

Parser/API compatibility

Allocation-friendly capture extraction

Named capture projection

DFA/codegen scalability and fallback behavior

Structural token-sequence execution path

Correctness and regression coverage

Performance validation

Validation

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jbachorik commented May 29, 2026 •

edited

Loading