Skip to content

[pcre] Non-greedy (lazy) quantifiers inside capturing groups produce wrong captures #37

@jbachorik

Description

@jbachorik

Summary

Lazy/non-greedy quantifiers (*?, +?, ??, {n,m}?) inside capturing groups do not produce the correct minimal-match captures. The Thompson NFA does not natively support backtracking required for lazy semantics when groups are involved.

Failing PCRE Tests (Category 2)

  • (|ab)*?d on abd — expected group 1 = ab, got null
  • ^[ab]{1,3}?(ab*?|b) on aabbbbb — expected group 1 = a, got abbbbb
  • ^[ab]{1,3}?(ab*|b) on aabbbbb — expected group 1 = abbbbb, got b
  • (?i)(a+|b){0,1}? on AB — expected group 1 = '', got 'A'
  • (([a-c])b*?\2){3} should match ababbbcbc but didn't

Expected gain: +5 PCRE conformance tests

Root Cause

Thompson NFA processes states in parallel without backtracking. Lazy quantifiers in capturing groups need the engine to prefer the shorter match and backtrack to longer ones only when necessary. A new LAZY_QUANTIFIER_BACKTRACK strategy is likely required.

Implementation Notes

  • Tracked in doc/plans/pcre-conformance-roadmap.md as Phase 3, item 3.1
  • Difficulty: High
  • Proposed approach: New LAZY_QUANTIFIER_BACKTRACK bytecode generation strategy
  • Files: PatternAnalyzer.java (detection), new LazyQuantifierBytecodeGenerator.java

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingenhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions