Skip to content

Improve runtime compatibility, capture extraction, and token-sequence execution#68

Merged
jbachorik merged 40 commits into
mainfrom
jb/logs-backend
May 29, 2026
Merged

Improve runtime compatibility, capture extraction, and token-sequence execution#68
jbachorik merged 40 commits into
mainfrom
jb/logs-backend

Conversation

@jbachorik
Copy link
Copy Markdown
Collaborator

@jbachorik jbachorik commented May 29, 2026

Summary

This PR improves Reggie's runtime compatibility and performance for large capture-heavy regexes, especially Grok-style patterns that expand into long regular expressions with many named groups.

The work spans parser compatibility, allocation-friendly capture extraction, capture projection, safer fallback behavior, DFA scalability improvements, and a new structural execution path for deterministic token-sequence patterns.

Major changes

Parser/API compatibility

  • Supports atomic groups (?>...) by parsing them as non-capturing groups. For Reggie's non-backtracking engines, atomic-group backtracking-prevention semantics are equivalent to (?:...).
  • Supports \Q...\E quoted literals, including inside character classes.
  • Adds public UnsupportedPatternException for callers that need precise unsupported-pattern handling.

Allocation-friendly capture extraction

  • Adds matchInto(String input, int[] groupStarts, int[] groupEnds) for caller-owned capture-boundary arrays.
  • Adds corresponding find/match-into support across runtime and generated paths.
  • Preserves the contract that caller arrays remain unchanged on no-match.

Named capture projection

  • Adds CapturePolicy and ReggieOptions.
  • Adds CapturePolicy.NAMED_ONLY to preserve original named-group indexes while dropping unobservable unnamed captures from runtime capture tracking.
  • Keeps named-group lookup compatible with callers that discover original group indexes from the source pattern and later read by index.

DFA/codegen scalability and fallback behavior

  • Adds DFA state-budget handling for oversized state spaces.
  • Adds a table-driven DFA backend.
  • Improves generated-method-too-large diagnostics with class/method/descriptor/code-size details.
  • Fixes lookaround/fallback behavior needed by large expanded patterns.

Structural token-sequence execution path

Adds a generic structural route:

regex AST -> PatternCategorizer -> LinearTokenSequencePlan -> LinearTokenSequenceMatcher

This route recognizes reusable deterministic token atoms such as:

  • literals and whitespace
  • non-space fields
  • IP/host-like fields
  • quoted fields
  • delimiter captures
  • signed integers and decimals
  • optional token subsequences
  • trailing bounded bracketed-word captures

The route is structural: it does not depend on exact pattern strings or specific capture names.

Correctness and regression coverage

  • Adds parser compatibility tests.
  • Adds capture-policy and match-into API tests.
  • Adds structural categorizer and linear-token-sequence planner/runtime tests.
  • Adds real expanded Grok-style regex fixtures as regression tests.
  • Adds JDK/Reggie named-capture boundary equivalence tests for representative large capture-heavy patterns, including optional fields and delimiter/quoted-field edge cases.
  • Adds/extends routing/fallback tests and algorithmic fuzzing infrastructure.

Performance validation

Representative integrated Grok-style benchmark with native token-sequence routing:

Engine Score Allocation
JDK regex 16.210 ± 2.128 us/op 7701.393 ± 154.805 B/op
Reggie native token sequence 2.353 ± 0.161 us/op 7682.979 ± 61.845 B/op

Coverage for the benchmarked patterns:

2/2 native, 0/2 internal JDK fallback, 0/2 supplier JDK fallback

Validation

./gradlew spotlessApply build

Passed.

jbachorik and others added 30 commits May 26, 2026 22:52
SubsetConstructor now tracks the weakest anchor conjunction required
to reach each NFA state during ε-closure. END/STRING_END_ABSOLUTE
paths followed by a consumer are pruned at construction; START-class
paths get a per-transition entry guard; accept conditions propagate
into per-DFA-state acceptanceAnchorConditions. The two DFA codegens
emit those checks per state and drop the legacy global hasEndAnchor
gate at accept sites. Fixes bare \$ matching at [0,0), \$X behaving
as X\$, and the \$X|Y branch poisoning that broke \$[^a-zA-Z0-9]|^[0-9].
NFA.requiresAnchorOnAllPaths no longer vacuously reports true for
patterns with no char transitions, so the find loop reaches its
empty-match-at-end handler for bare \$.

Adds AnchorRegressionTest cross-checking against java.util.regex.Pattern
and AnchorPlacementBenchmark covering the affected pattern shapes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…guard

Two find()-path bugs in bounded quantifiers, both pre-existing on main:

1. STATELESS_LOOP's generateFindMatchFromMethod greedy-extended the
   match end past the quantifier's upper bound. {0-9}{5} matched all
   digits, {0-9}{5,7} matched up to the input length. Cap the matchEnd
   scan at matchStart + maxReps; the matches()/find()/findBoundsFrom
   variants already had this check.

2. SWARPatternAnalyzer returned a MultiRangeOptimization for any
   multi-range CharSet, but MultiRangeOptimization only emits correct
   bytecode for [a-zA-Z] and [a-zA-Z0-9]. Any other shape silently
   falls back to scanning the first range only, so {[-_]?[0-9]{5,99}}
   compiled to a SWAR loop searching for '-' alone and missed every
   input that started with a digit or '_'. Gate MultiRangeOptimization
   to the two supported shapes; other multi-range cases now use the
   slower-but-correct charAt filter.

Adds BoundedQuantifierRegressionTest cross-checking against JDK Pattern.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Grammar-driven regex generator over a small alphabet (a/b/c/0/1/-/_),
bounded depth, with the JDK as oracle. Each (pattern, input) is fed
through Reggie.matches() and Reggie.findMatch() and compared to
Pattern.matches() / Matcher.find(); divergences land in a Finding
list. Patterns either engine rejects are skipped, not failed.

Smoke test runs 500 patterns × 8 inputs deterministically (seed
0xC0DEFEED_DEADBEEFL) — about 2 seconds. Findings are printed for
triage; the test only fails on a runaway regression (> 25% finding
rate). The current default seed surfaces ~160 divergences from
known pre-existing bugs in non-greedy quantification, quantified
anchors, negated char-classes, and weird backref placements —
seed material to triage and fix.

Plans for the fuzz-test framework and the related sub-2× perf
candidates landed under doc/plans/.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Single-char-deletion shrinker iterated to a fixpoint reduces each
divergent (pattern, input) pair to its minimal still-failing form;
the AlgorithmicFuzzTest dedupes shrunk findings before printing.
On the default seed: 161 raw findings collapse to 64 unique minimal
repros, most 4-6 chars long.

doc/plans/fuzz-findings-triage.md groups the minimal repros into
six categories by likely root cause (lazy quantifiers, zero-width
matches, negated char-class bound zero, self-referencing backrefs,
quantified anchors, anchor placement) and recommends an execution
order — starting with lazy quantifiers, which is the largest cluster
and probably a single root cause.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ThompsonBuilder.buildCountedQuantifier built (max - min) optional
copies for {n,m} but only marked fragments[1..] as optional via the
i >= min check inside the chain loop. fragments[0] itself was always
required, so `c{0,3}` could match 1/2/3 c's but never 0, and any
pattern of the form `prefix X{0,N}` failed against a prefix-only
input — e.g. `[ab]c{0,3}` against "a", `[^c]c{0,3}` against "b".

Adds an explicit "0-reps bypass" by inserting the first fragment's
entry into allExits when min == 0. The whole counted-quantifier
fragment now exposes both its real exits and its entry, so traversal
with zero iterations is recognized end-to-end during DFA construction.

Regression test added to BoundedQuantifierRegressionTest. The fuzz
suite goes from 161 to 155 divergences on the default seed.

Cat A (lazy quantifiers) was investigated but defers to a follow-up;
three design options documented in doc/plans/fuzz-findings-triage.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…branch requires \A

The position-skip optimization in DFAUnrolled / DFASwitch findFrom
was reading `hasStringStartAnchor` (any \A anywhere in the pattern)
in addition to `requiresStartAnchor`. For patterns like `]\A|b`
where only one branch needs \A but the other can match anywhere,
this made find() return -1 at every non-zero position — masking
the always-valid branch entirely.

`requiresStartAnchor()` already treats both ^ and \A as barriers in
the all-paths analysis, so it returns true only when every viable
path requires one of them. Using just `requiresStartAnchor` is the
sound condition. Drop the `hasStringStartAnchor` or-arm.

Also: stop the fuzz generator from emitting self-referencing
backrefs (e.g. (\1\1)) — JDK and Reggie disagree on these
semantically-pathological shapes and the disagreement is documented
as accepted divergence (Cat D in the triage doc).

Fuzz divergences drop from ~156 to ~135 on the default seed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ed min

- Cat A: lazyFindMode runtime flag — find() returns min match, matches() extends greedily
- Cat E: zero-width greedy loop keeps counting until min reached
…ck, NFA zero-width + backref semantics

- SWARPatternAnalyzer: disable LiteralSetOptimization for 2-4 chars; it only
  searched literals[0], causing find() to miss positions for other chars
  (e.g. _*0|... scanning for '-' only, missing '0')
- FallbackPatternDetector: detect cross-alternative backrefs (\N in alt-i
  when group N is defined in alt-j≠i) and route to JDK for both
  OPTIMIZED_NFA_WITH_BACKREFS and RECURSIVE_DESCENT; Thompson NFA shared
  group state and RD backtracking both produce wrong results in this case
- NFABytecodeGenerator.generateFindMatchFromMethod: start matchEnd at
  matchStart (not matchStart+1) so zero-width matches are tried; use
  matchStart-1 as longestEnd sentinel; null-return check IF_ICMPGE not ICMPNE
- RecursiveDescentBytecodeGenerator: unset group in backref returns -1
  (JDK: fail) instead of pos (PCRE: match empty)
- LinearPatternAnalyzer.visitLiteral: skip epsilon LiteralNode('\0') that
  the parser emits for empty group body (){n}
- NFA.contentHashCode: include state.backrefCheck so patterns differing
  only in referenced group number don't share an L2 structural-cache entry

Fuzz findings: 18 → 4 (2 unique repros), well below 10% ceiling.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…y fallback doc

- SubsetConstructor: set anchorConditionDiluted when differing-anchor contributors
  share a partition slice or accept site (transition guard / acceptance intersection
  collapses to unconditional); DFA carries the flag via new constructor param
- PatternAnalyzer: check dfa.isAnchorConditionDiluted() and alternation-priority
  conflict in both the plain-DFA and tagged-DFA paths; flag MatchingStrategyResult
  so RuntimeCompiler routes to JavaRegexFallbackMatcher
- DFAUnrolledBytecodeGenerator: remove the STRING_END early-return from
  generateStateCode(); matches() requires the full input to be consumed, so the
  "before-final-newline" path is invalid there — the end-of-input handler already
  accepts when pos == length
- StringAnchorsTest: correct assertions to match JDK semantics (abc\Z matches("abc\n")
  is false; the trailing \n is not consumed by \Z in matches() mode)
- AnchorRegressionTest: add four Cat-E/F anchor-dilution regression cases and a new
  testStringEndMatchesMode_doesNotConsumeTrailingNewline block that cross-checks
  Reggie against JDK for \Z in matches()
- FallbackPatternDetector: add hasNullableBackrefGroup() — OPTIMIZED_NFA_WITH_BACKREFS
  falls back when \N references a nullable group; shared group arrays record the greedy
  (non-empty) capture, causing the zero-length backref path to use the wrong span
- FallbackPatternDetector: document why lazy quantifiers remain in JDK fallback —
  RECURSIVE_DESCENT lacks general alternation backtracking; attempted removal exposed 36
  distinct failures all rooted in the same (a|ab)-style commitment problem
- NFAFallbackPatterns: relax xmlTags() to greedy .* with a comment explaining the
  original .*? falls back to java.util.regex and why
- ReggieMatcherBytecodeGeneratorTest: replace \d+? (now JDK fallback) with (\d+)\1{1,2}
  which routes to RECURSIVE_DESCENT via hasQuantifiedBackrefs
- Fuzz ceiling tightened from 25% to 10% now that Cat-E/F findings are resolved

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…ptionalGroupBackref

Six bug classes fixed, all confirmed by 0 findings in the 5 000-pattern fuzz sweep:

1. $ / \Z in extractQuantifierFromPattern: anchor-skip was too broad; END-type anchors
   before a char consumer make patterns unmatchable at non-end positions. extractQuantifier
   FromPattern now returns null for END/STRING_END, routing them to DFA/NFA instead of
   SPECIALIZED_QUANTIFIED_GROUP which was unaware of the constraint. FallbackPatternDetector
   adds a corresponding end-anchor-before-consumer rule for residual cases.

2. Negated CharClassNode ([^x]) in SPECIALIZED_QUANTIFIED_GROUP: detectQuantifiedCapturing
   Group was discarding CharClassNode.negated, so isNegatedCharSet() always returned false
   for simple char-class groups. Generator then used the wrong negation direction — matching
   only the excluded char instead of everything else. Fix: propagate isNegatedCC to the full
   QuantifiedGroupInfo constructor and use info.isNegatedCharSet() in the generator.

3. ({0}) zero-max quantifier: GreedyCharClassBytecodeGenerator accepted max==0 and produced
   code that could never return a non-null match. detectGreedyCharClass now returns null for
   max==0; the pattern falls through to a strategy (SPECIALIZED_CONCAT_GREEDY_GROUP) that
   correctly emits an always-empty match.

4. SPECIALIZED_GREEDY_CHARCLASS findFrom for min=0: the char-scan loop skipped every
   position where the first char wasn't in the class, missing the valid empty match that
   min=0 (*) always yields at the scan start. generateFindFromMethod now returns start
   immediately when minMatches==0.

5. Lazy quantifiers in OPTIMIZED_NFA_WITH_BACKREFS: findMatchFromMethod returns the LONGEST
   match; lazy patterns need the SHORTEST. Extended the FallbackPatternDetector lazy rule to
   also cover OPTIMIZED_NFA_WITH_BACKREFS, so b+?|()(\1) and similar route to JDK.

6. (X)?\1 OptionalGroupBackref with non-participating group: generator treated the "group
   not matched" path as "backref satisfied" (vacuously matching empty), contrary to Java
   semantics where \N to a non-participating group FAILS. FallbackPatternDetector now routes
   OPTIONAL_GROUP_BACKREF patterns where the group content is non-nullable to JDK. Updated
   OptionalGroupBackrefTest to match verified JDK behaviour.

Additional: containsOptionalQuantifier added to the groups-path alternation-priority
conflict check, catching DFA over-greed in patterns like .([a]?[0-b]{3})+ where the
optional [a]? inside a repeating group creates implicit alternation. The outer quantifier
fixed-count / unbounded-inner guard narrowed so ([ab]+)+ (unbounded outer) is not affected.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…lbacks

The containsOptionalQuantifier check introduced to fix .([a]?[0-b]{3})+ was
too broad: it also flagged (a*b*c*d*e*) and similar patterns where the DFA
result is correct, routing them to JavaRegexFallbackMatcher and cutting
benchmark throughput by ~200x for those patterns.

The divergence only occurs when an optional quantifier sits INSIDE a group
that is itself in a repeating quantifier (outer + or * or {n,m} with max>1).
Without the repeating outer loop, the DFA cannot accumulate extra chars via
ambiguous optional paths. Replace the broad walk with
hasOptionalInsideRepeatingGroup which only fires for the pattern:
  QuantifierNode(max>1, child=GroupNode(...optional inside...))

- (a*b*c*d*e*): group not in a repeating quantifier → NOT flagged (181k ops/ms restored)
- .([a]?[0-b]{3})+: [a]? inside (...)+ → still flagged → JDK fallback ✓

Fuzz: 0 findings on 5 000-pattern sweep.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a4cb2184e8

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@jbachorik jbachorik changed the title Adopt native token-sequence route for logs-backend Grok patterns Add native token-sequence route for structured log patterns May 29, 2026
@jbachorik jbachorik changed the title Add native token-sequence route for structured log patterns Add native linear token-sequence regex execution path May 29, 2026
@jbachorik jbachorik changed the title Add native linear token-sequence regex execution path Improve runtime compatibility, capture extraction, and token-sequence execution May 29, 2026
@jbachorik jbachorik requested a review from Copilot May 29, 2026 12:52
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR significantly expands Reggie’s runtime and codegen capabilities to better handle large, capture-heavy (Grok-style) patterns with improved compatibility, lower-allocation capture extraction, and new deterministic “linear token sequence” execution routing.

Changes:

  • Adds runtime API enhancements: matchInto / findMatchInto, CapturePolicy + ReggieOptions, and a public UnsupportedPatternException for precise unsupported-pattern handling.
  • Improves parser compatibility and routing: atomic groups (?>...) (parsed as non-capturing), \Q...\E quoted literals (including inside char classes), plus expanded tests/resources for real Grok patterns.
  • Introduces scalability and correctness work across DFA/NFA/codegen: DFA state-budget handling + DFA_TABLE backend, structural-hash hardening, multiple anchor/quantifier/backref correctness fixes, and new fuzzing infrastructure.

Reviewed changes

Copilot reviewed 89 out of 89 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
reggie-runtime/src/test/resources/com/datadoghq/reggie/runtime/logs-grok-pattern-1.regex Adds Grok-expanded regex fixture for regression/routing coverage.
reggie-runtime/src/test/resources/com/datadoghq/reggie/runtime/logs-grok-pattern-2.regex Adds a second Grok-expanded regex fixture for regression/routing coverage.
reggie-runtime/src/test/java/com/datadoghq/reggie/runtime/UnsupportedPatternExceptionTest.java Verifies unsupported constructs throw the new public exception type.
reggie-runtime/src/test/java/com/datadoghq/reggie/runtime/StringAnchorsTest.java Updates expectations around matches() + \\Z / newline consumption.
reggie-runtime/src/test/java/com/datadoghq/reggie/runtime/PCREParityDebugTest.java Adjusts debug test semantics to use findMatch and clarifies intent.
reggie-runtime/src/test/java/com/datadoghq/reggie/runtime/OptionalGroupBackrefTest.java Aligns optional-group backref expectations to JDK semantics.
reggie-runtime/src/test/java/com/datadoghq/reggie/runtime/MatchIntoAPITest.java Adds tests for matchInto/findMatchInto behavior and overrides across strategies.
reggie-runtime/src/test/java/com/datadoghq/reggie/runtime/LookbehindVariantsTest.java Adds regression coverage ensuring native DFA switch path handles IPv4 digit boundaries.
reggie-runtime/src/test/java/com/datadoghq/reggie/runtime/LogsBackendParserCompatibilityTest.java Tests atomic group and \\Q...\\E parsing behavior relevant to logs backend patterns.
reggie-runtime/src/test/java/com/datadoghq/reggie/runtime/LinearTokenSequenceMatcherTest.java Adds tests for the new structural linear-token-sequence matcher and capture extraction.
reggie-runtime/src/test/java/com/datadoghq/reggie/runtime/FallbackVerificationTest.java Updates fallback expectations for fixed lookbehind+lookahead sandwich behavior.
reggie-runtime/src/test/java/com/datadoghq/reggie/runtime/DFAStateBudgetFallbackTest.java Adds tests validating DFA_TABLE routing and compilation stability for large patterns.
reggie-runtime/src/test/java/com/datadoghq/reggie/runtime/CapturePolicyTest.java Verifies CapturePolicy.NAMED_ONLY indexing and capture dropping semantics.
reggie-runtime/src/test/java/com/datadoghq/reggie/runtime/BoundedQuantifierRegressionTest.java Regression tests for bounded quantifier upper-bound and multi-range SWAR filtering fixes.
reggie-runtime/src/test/java/com/datadoghq/reggie/runtime/AnchorDiagTest.java Adds diagnostic coverage for $-anchor cases (currently implemented as stdout-only).
reggie-runtime/src/main/java/com/datadoghq/reggie/UnsupportedPatternException.java Introduces a public exception type for unsupported-but-valid regex constructs.
reggie-runtime/src/main/java/com/datadoghq/reggie/runtime/ReggieMatcher.java Adds caller-owned capture-boundary APIs (matchInto/findMatchInto) and supporting reusable state.
reggie-runtime/src/main/java/com/datadoghq/reggie/runtime/JavaRegexFallbackMatcher.java Implements matchInto/findMatchInto for the JDK fallback path.
reggie-runtime/src/main/java/com/datadoghq/reggie/runtime/HybridMatcher.java Overrides matchInto to combine DFA pre-check + NFA capture extraction.
reggie-runtime/src/main/java/com/datadoghq/reggie/ReggieOptions.java Adds runtime compilation options container (currently capture-policy focused).
reggie-runtime/src/main/java/com/datadoghq/reggie/Reggie.java Adds compile/cached overloads accepting ReggieOptions.
reggie-runtime/src/main/java/com/datadoghq/reggie/CapturePolicy.java Adds capture tracking policy enum including NAMED_ONLY.
reggie-processor/src/test/java/com/datadoghq/reggie/processor/ReggieMatcherBytecodeGeneratorTest.java Updates processor tests to reflect routing/semantics changes (recursive descent + optional group backrefs).
reggie-processor/src/test/java/com/datadoghq/reggie/processor/parsing/RegexParserTest.java Adds parser tests for atomic groups and \\Q...\\E quoted literals.
reggie-processor/src/main/java/com/datadoghq/reggie/processor/ReggieMatcherBytecodeGenerator.java Wires matchInto generation and DFA_TABLE generator; adds recursive-descent state init.
reggie-integration-tests/src/test/java/com/datadoghq/reggie/integration/DollarAnchorCacheDiagTest.java Adds integration diagnostics around $ anchor/cache interactions and fuzz-sweep reproduction.
reggie-integration-tests/src/test/java/com/datadoghq/reggie/integration/AlgorithmicFuzzTest.java Adds oracle-based fuzz sweep against JDK regex with shrink/dedupe + ceiling guard.
reggie-integration-tests/src/main/java/com/datadoghq/reggie/integration/fuzz/RegexFuzzShrinker.java Adds shrinker that reduces divergent cases to minimal repros.
reggie-integration-tests/src/main/java/com/datadoghq/reggie/integration/fuzz/RegexFuzzOracle.java Adds JDK-vs-Reggie comparison oracle for matches/findMatch behavior.
reggie-integration-tests/src/main/java/com/datadoghq/reggie/integration/fuzz/RandomRegexGenerator.java Adds grammar-driven deterministic regex generator for fuzzing.
reggie-integration-tests/src/main/java/com/datadoghq/reggie/integration/fuzz/RandomInputGenerator.java Adds input generator aligned to fuzz regex alphabet (incl. newline).
reggie-integration-tests/src/main/java/com/datadoghq/reggie/integration/fuzz/FuzzRunner.java Adds fuzz driver with configuration and finding caps.
reggie-integration-tests/build.gradle Forwards -Dreggie.* JVM properties into test JVM for configurable fuzz size.
reggie-codegen/src/test/java/com/datadoghq/reggie/codegen/analysis/StrategySelectionTest.java Adds tests for DFA_TABLE selection and negated-class routing expectations.
reggie-codegen/src/test/java/com/datadoghq/reggie/codegen/analysis/pbt/PatternRoutingPropertyBasedTest.java Updates property-based routing expectations around large state-space behavior.
reggie-codegen/src/test/java/com/datadoghq/reggie/codegen/analysis/PatternRoutingPropertyTest.java Updates DFA examples to reflect large-DFA fallback routing changes.
reggie-codegen/src/test/java/com/datadoghq/reggie/codegen/analysis/PatternCategorizerTest.java Adds tests for new structural categorizer and token kind extraction.
reggie-codegen/src/test/java/com/datadoghq/reggie/codegen/analysis/LinearTokenSequencePlanTest.java Adds tests for plan generation from categorization (including quoted capture folding).
reggie-codegen/src/test/java/com/datadoghq/reggie/codegen/analysis/FallbackPatternDetectorTest.java Updates bug-5 regression expectation: no blanket fallback for lookbehind+lookahead.
reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/parsing/RegexParser.java Adds atomic-group parse support and \\Q...\\E quoted literal parsing (incl. char classes).
reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/codegen/SWARPatternAnalyzer.java Gates multi-range SWAR optimization to known-correct shapes to avoid missed matches.
reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/codegen/StatelessLoopBytecodeGenerator.java Fixes bounded-quantifier upper bound handling and {0} edge cases in generated loops.
reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/codegen/QuantifiedGroupBytecodeGenerator.java Fixes negated charset flag propagation for quantified groups.
reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/codegen/OptionalGroupBackrefBytecodeGenerator.java Aligns optional-group backref handling with Java semantics (non-participating group fails).
reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/codegen/OnePassBytecodeGenerator.java Updates $ anchor checks to handle Java’s end/before-final-newline semantics.
reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/codegen/LinearPatternBytecodeGenerator.java Mirrors $ anchor semantics update in linear bytecode path.
reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/codegen/GreedyCharClassBytecodeGenerator.java Fixes min=0 empty-match behavior for greedy charclass find-from generation.
reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/codegen/BoundedQuantifierBytecodeGenerator.java Fixes minimum repetition enforcement for bounded quantifiers when min > 1.
reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/automaton/ThompsonBuilder.java Fixes counted-quantifier min==0 epsilon-bypass path in NFA construction.
reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/automaton/NFA.java Hardens anchor reachability and structural hashing of anchor/assert/backref state features.
reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/automaton/DFATableData.java Adds compact transition-table representation for large pure DFAs.
reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/automaton/DFA.java Adds acceptance/entry anchor guards and an anchor-condition dilution signal.
reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/analysis/StructuralHash.java Moves structural hash to 64-bit and hashes enum-driven data via ordinal/bitmask.
reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/analysis/PatternCategorization.java Adds categorization record for structural routing.
reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/analysis/PatternAtom.java Adds semantic atom model used by categorizer and linear plan builder.
reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/analysis/LinearTokenSequencePlan.java Adds executable plan representation for token-sequence matcher execution.
reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/analysis/LinearPatternAnalyzer.java Ensures epsilon literal nodes (char 0) are treated as non-consuming.
reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/analysis/CaptureProjection.java Adds AST rewrite to preserve named/semantic captures while dropping unobservable unnamed captures.
reggie-benchmark/src/main/java/com/datadoghq/reggie/benchmark/NFAMatchIntoBenchmark.java Adds benchmark for matchInto-based capture boundary extraction.
reggie-benchmark/src/main/java/com/datadoghq/reggie/benchmark/NFAFallbackPatterns.java Adjusts benchmark pattern to avoid lazy-quantifier fallback and still exercise recursive descent.
reggie-benchmark/src/main/java/com/datadoghq/reggie/benchmark/LogsBackendGrokBenchmark.java Adds Grok-like access-log benchmark for native token-sequence routing + matchInto.
reggie-benchmark/src/main/java/com/datadoghq/reggie/benchmark/DFATableBenchmark.java Adds performance benchmarks for large pure DFAs using DFA_TABLE backend.
reggie-benchmark/src/main/java/com/datadoghq/reggie/benchmark/AnchorPlacementBenchmark.java Adds benchmarks targeting anchor-placement correctness fixes and overhead.
doc/plans/sub-2x-perf-candidates.md Adds performance triage notes for candidate slow benchmarks.
doc/plans/logs-backend.md Adds adoption requirements and describes new structural token-sequence route.
doc/plans/issue-priority.md Adds issue prioritization plan for correctness/features.
doc/plans/fuzz-findings-triage.md Adds fuzz divergence triage notes and fix directions.
doc/plans/fuzz-findings-triage-EF-residual.md Documents residual anchor-condition dilution issue and possible remedies.
doc/plans/algorithmic-fuzz-tests-vs-jdk.md Adds design doc describing the oracle-based fuzz approach.
doc/libretti/2026-05-11-datadogjava-reggie36-pcre-lookahead-combined-with-nested-alternation-produces-wro.md Adds spec/trace doc for implemented PCRE parity fix.
doc/libretti/2026-05-10-datadogjava-reggie30-bug-only-first-alternative-in-lookbehind-alternation-is-chec.md Adds spec doc for lookbehind-alternation bug.
doc/libretti/2026-05-10-datadogjava-reggie29-bug-unbounded-quantifier-after-lookbehind-always-fails-to-ma.md Adds spec doc for lookbehind+unbounded-quantifier bug.
doc/libretti/2026-05-09-datadogjava-reggie35-pcre-inline-m-flag-inside-a-group-doesnt-activate-multiline.md Adds spec doc for inline multiline-flag scoping bug.
doc/libretti/2026-05-08-datadogjava-reggie27-bug-multiple-backreferences-to-same-group-produce-false-posi.md Adds spec doc for multiple-backref correctness bug.
AGENTS.md Adds a hard rule about updating StructuralHash when DFA/NFA/PatternInfo structures change.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread AGENTS.md
Comment thread reggie-runtime/src/test/java/com/datadoghq/reggie/runtime/AnchorDiagTest.java Outdated
Comment thread reggie-runtime/src/test/java/com/datadoghq/reggie/runtime/AnchorDiagTest.java Outdated
@jbachorik jbachorik merged commit 5f5706c into main May 29, 2026
8 checks passed
@jbachorik jbachorik deleted the jb/logs-backend branch May 29, 2026 13:18
@jbachorik jbachorik added this to the 0.3.0 milestone May 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants