Summary
Lazy/non-greedy quantifiers (*?, +?, ??, {n,m}?) inside capturing groups do not produce the correct minimal-match captures. The Thompson NFA does not natively support backtracking required for lazy semantics when groups are involved.
Failing PCRE Tests (Category 2)
(|ab)*?d on abd — expected group 1 = ab, got null
^[ab]{1,3}?(ab*?|b) on aabbbbb — expected group 1 = a, got abbbbb
^[ab]{1,3}?(ab*|b) on aabbbbb — expected group 1 = abbbbb, got b
(?i)(a+|b){0,1}? on AB — expected group 1 = '', got 'A'
(([a-c])b*?\2){3} should match ababbbcbc but didn't
Expected gain: +5 PCRE conformance tests
Root Cause
Thompson NFA processes states in parallel without backtracking. Lazy quantifiers in capturing groups need the engine to prefer the shorter match and backtrack to longer ones only when necessary. A new LAZY_QUANTIFIER_BACKTRACK strategy is likely required.
Implementation Notes
- Tracked in
doc/plans/pcre-conformance-roadmap.md as Phase 3, item 3.1
- Difficulty: High
- Proposed approach: New
LAZY_QUANTIFIER_BACKTRACK bytecode generation strategy
- Files:
PatternAnalyzer.java (detection), new LazyQuantifierBytecodeGenerator.java
Summary
Lazy/non-greedy quantifiers (
*?,+?,??,{n,m}?) inside capturing groups do not produce the correct minimal-match captures. The Thompson NFA does not natively support backtracking required for lazy semantics when groups are involved.Failing PCRE Tests (Category 2)
(|ab)*?donabd— expected group 1 =ab, gotnull^[ab]{1,3}?(ab*?|b)onaabbbbb— expected group 1 =a, gotabbbbb^[ab]{1,3}?(ab*|b)onaabbbbb— expected group 1 =abbbbb, gotb(?i)(a+|b){0,1}?onAB— expected group 1 ='', got'A'(([a-c])b*?\2){3}should matchababbbcbcbut didn'tExpected gain: +5 PCRE conformance tests
Root Cause
Thompson NFA processes states in parallel without backtracking. Lazy quantifiers in capturing groups need the engine to prefer the shorter match and backtrack to longer ones only when necessary. A new
LAZY_QUANTIFIER_BACKTRACKstrategy is likely required.Implementation Notes
doc/plans/pcre-conformance-roadmap.mdas Phase 3, item 3.1LAZY_QUANTIFIER_BACKTRACKbytecode generation strategyPatternAnalyzer.java(detection), newLazyQuantifierBytecodeGenerator.java