7/18/2014

Regexp Lookahead & Lookbehind


Source: http://denis-zhdanov.blogspot.com/2009/10/regexp-lookahead-and-lookbehind.html
Source: http://www.regular-expressions.info/lookaround.html

Lookahead


Generally speaking 'lookahead' allows to check if subsequent input characters match particular regexp. For example, let's consider word 'murmur' as our input and capitalize all 'r' symbols that are not followed by 'm' symbol (i.e. we expect to get 'murmuR' as the output). Here is a naive approach:

    System.out.println("murmur".replaceAll("r[^m]", "R"));


However, it doesn't perform any change, i.e. 'murmur' is printed. The reason is that 'r[^m]' means that the pattern is 'r' symbol followed by the symbol over than 'm'. There is no symbol after the last 'r' symbol, so, the pattern is not matched.

Here lookahead comes to the rescue - it allows to check if subsequent symbol(s) match to the provided regexp without making matched subsequent character(s) part of the match. It may be 'positive' or 'negative', i.e. allows to define if we're interested in match or unmatch. Here is the example:

// 'Negative' lookahead, 'r' is not followed by 'm'. >> prints 'murmuR'
System.out.println("murmur".replaceAll("r(?!m)", "R")); 

// 'Positive' lookahead, 'r' is followed by 'm'. >> prints 'muRmur'
System.out.println("murmur".replaceAll("r(?=m)", "R")); 

// It's possible to use regexp as 'lookahead' pattern >> prints 'murMur'
System.out.println("murmur".replaceAll("m(?!u[^u]+u)", "M")); // 



Lookbehind


'Lookbehind' behaves very similar to 'lookahead' but works backwards. Another difference is that it's possible to define only finite repetition regexp as 'lookbehind' pattern:

// 'Negative' lookbehind, 'c' is not preceded by 'b' >> prints 'abcadCaeeC' 
System.out.println("abcadcaeec".replaceAll("(?<!b)c", "C"));

// 'Positive' lookbehind, 'c' is preceded by 'b' >> prints 'abCadcaeec'
System.out.println("abcadcaeec".replaceAll("(?<=b)c", "C")); // 

// It's possible to use finite repetition regexp as 'lookbehind' pattern >> doesn't match because there is no 'c' preceded by two 'e'
System.out.println("abcadcaec".replaceAll("(?<=e{2})c", "C")); 

// It's not possible to use regexp that doesn't imply obvious max length as 'lookbehind' pattern >> PatternSyntaxException is thrown here
System.out.println("abcadcaec".replaceAll("(?<=a[^a]+ae?)c", "C"));

No comments:

Post a Comment