大局观
我们将首先从整体大局算法中查看此正则表达式,然后稍后仔细查看具体的实现细节。正则表达式几乎是以下 Java 代码的直接翻译:
static boolean isPalindrome(String s) {
if (s.isEmpty()) {
return true;
}
String g2 = null;
for (char ch : s.toCharArray()) {
String g1 = String.valueOf(ch);
// "add"
if (g2 != null && s.endsWith(g1 + g2)) {
g2 = g1 + g2;
} else if (s.endsWith(g1)) {
g2 = g1;
} else {
break;
}
}
return s.equals(g2); // "chk"
}
这显然不是检查回文的最直接/最有效的Java代码,但它是有效的,最令人着迷的是,它几乎可以直接转换为正则表达式,并具有近乎一对一的映射。这里再次是正则表达式,为方便起见,在此处复制,并进行了注释以突出显示惊人的相似之处:
// isEmpty _for-loop_
// ↓ / \
"(?x) | (?:(.) add)+ chk"
// \_/ ↑
// g1 loop body ___g2___
// / \
.replace("add", assertEntirety(".*? (\\1 \\2?)"))
.replace("chk", assertEntirety("\\2"));
// s.equals(g2)
附件:ideone.com 上源代码的注释和扩展版本
(现在随意忽略的细节:只需将其视为黑盒正则表达式机制,无论我们当前的位置如何,都可以对整个字符串进行断言。assertEntirety
因此,基本算法是,当我们从左到右扫描字符串时,我们尝试构建一个后缀,受回文约束的约束。然后,我们检查是否能够以这种方式构建完整的字符串。如果可以的话,那么字符串就是一个回文。此外,作为一个特殊情况,空字符串是微不足道的回文。
一旦理解了大局算法,我们就可以检查正则表达式模式如何实现它。
所有的?String.replace
Java中的正则表达式模式最终只不过是字符串,这意味着它们可以通过字符串操作来派生,就像任何字符串一样。是的,我们甚至可以使用正则表达式来生成正则表达式模式 - 一种元正则表达式方法,如果你愿意的话。
考虑这个初始化常量的示例(它最终只包含一个数字):int
final int X = 604800;
final int Y = 60 * 60 * 24 * 7;
// now X == Y
分配给的数字是一个文字整数:我们可以清楚地看到数字是什么。使用表达式的情况并非如此,但是此公式似乎传达了此数字表示的含义。即使没有对这些常量进行适当的命名,我们仍然会得到可能表示一周内秒数的想法,即使我们可能不会立即知道数值是什么。另一方面,我们确切地知道这个数字,但我们对它代表什么的想法知之甚少。X
Y
Y
X
在代码段中使用字符串替换是一种类似的情况,但对于字符串正则表达式模式。有时,从更简单的部分对该值进行系统和逻辑派生(“公式”)而不是将模式显式编写为一个文本字符串,这样会更有意义。对于正则表达式尤其如此,在正则表达式中,我们了解模式的作用通常比能够看到它作为字符串文本的外观更重要(无论如何,这并不是一个外观,所有额外的反斜杠是什么)。
为方便起见,此处再次复制了部分代码段:
// the "formula"
final String PALINDROME =
"(?x) | (?:(.) add)+ chk"
.replace("add", assertEntirety(".*? (\\1 \\2?)"))
.replace("chk", assertEntirety("\\2"));
// the "value"
System.out.println(PALINDROME);
// ____add_____ chk_
// _______/ \____ _______/ \_____
// (?x) | (?:(.) (?<=(?=^.*? (\1 \2?)$).*))+ (?<=(?=^\2$).*)
// | \_/ \______/ |
// | 1 2 |
// |_______________________________|
毫无疑问,在这种情况下,“公式”比最终的字符串“value”更具可读性。
当然,有更复杂的方法来以编程方式生成正则表达式模式,并且当然可以以混淆而不是强调其含义的方式编写,但是即使使用简单的字符串替换,也有意识的使用仍然会令人惊奇(希望在本例中所示)。
课程:考虑以编程方式生成正则表达式模式。
如何运作?add
该结构是执行某种“计数”的断言,在前两部分中已经进行了彻底的讨论。不过,有两个功能值得注意:(?:(.) add)+
add
- 捕获到组 1 中,允许稍后进行反向引用
(.)
- 断言不仅仅是从我们目前的立场向前看
assertEntirety
- 我们稍后将更详细地讨论这一点;只需将其视为对整个字符串进行断言的一种方式
应用于 中的模式如下:assertEntirety
add
# prefix _suffix_
# ↓ / \
.*? ( \1 \2? )
# \________/ i.e. a reluctant "whatever" prefix (as short as possible)
# group 2 followed by a suffix captured into group 2
请注意,组 2 是使用可选说明符进行自引用,本系列的第 2 部分中已经讨论了该技术。毋庸置疑,组 2 是我们在这种模式中的“计数器”:它是一个后缀,我们将在“循环”的每次迭代中尝试向左扩展。当我们从左到右迭代每个字符时,我们尝试在后缀中附加相同的字符(使用反向引用)。(.)
\1
再次回想一下上述模式的 Java 代码转换,为方便起见,请在此处转载:
if (g2 != null && s.endsWith(g1 + g2)) { // \2? is greedy, we try this first
g2 = g1 + g2;
} else if (s.endsWith(g1)) { // since \2? is optional, we may also try this
g2 = g1;
} else { // if there's no matching suffix, we "break" out of the "loop"
break;
}
可选的事实意味着以下几点:\2?
- 它为自我参考提供了一个“基本案例”(我们这样做的主要原因!
- 由于 是后缀模式的一部分(因此在整体模式的后面出现),因此前缀部分必须不情愿,因此不是 。这允许行使其贪婪。
\2?
.*?
.*
\2?
- “计数器”也可能“重置”并给出“错误”的结果
- 在第 2 部分中,我们展示了回溯如何导致相同类型的有问题的重置
?
- 我们通过使用所有格量词解决了这个问题,但这在这里不适用
?+
第三点将在下一节中进一步阐述。
教训:仔细分析模式中贪婪/不情愿的重复之间的相互作用。
相关问题
Why do we need a phase?chk
As alluded in the previous section, the optional and backtrackable means that our suffix can shrink under some circumstances. We will examine such a scenario step by step for this input:\2?
x y x y z y x
↑
# Initial state, \2 is "uninitialized"
_
(x)y x y z y x
↑
# \1 captured x, \2 couldn't match \1\2 (since \2 is "uninitialized")
# but it could match \1 so it captured x
___
x(y)x y z y x
↑
# \1 captured y, \2 matched \1\2 and grew to capture yx
_
x y(x)y z y x
↑
# \1 captured x, \2 couldn't match \1\2,
# but it could match \1 so it shrunk to capture x (!!!)
___
x y x(y)z y x
↑
# \1 captured y, \2 matched \1\2 and grew to capture yx
_____
x y x y(z)y x
↑
# \1 captured z, \2 matched \1\2 and grew to capture zyx
_______
x y x y z(y)x
↑
# \1 captured y, \2 matched \1\2 and grew to capture yzyx
_________
x y x y z y(x)
↑
# \1 captured x, \2 matched \1\2 and grew to capture xyzyx
We can modify our pattern (and the corresponding Java code) to omit the phase, and see that indeed this is what happens:chk
// modified pattern without a chk phase; yields false positives!
final String PALINDROME_BROKEN =
"(?x) | (?:(.) add)+"
.replace("add", assertEntirety(".*? (\\1 \\2?)"));
String s = "xyxyzyx"; // NOT a palindrome!!!
Matcher m = Pattern.compile(PALINDROME_BROKEN).matcher(s);
if (m.matches()) {
System.out.println(m.group(2)); // prints "xyzyx"
}
As explained, , which is NOT a palindrome, is falsely reported as one, because we didn't check if the growing suffix eventually became the complete string (which it clearly did not in this case). The phase (which is an of the pattern ) is therefore an absolute necessity in our setup. We need to confirm that indeed we managed to grow our suffix all the way. If this is the case, then we have ourselves a palindrome."xyxyzyx"
chk
assertEntirety
\2
Lesson: Carefully analyze any possibly unintended side-effects of optional self-reference matching.
The Main Course: assertEntirety
While it's neat that we can write a Java regex pattern to detect palindromes, everything here except has already been covered in the previous parts of the series. The only new thing here is this mysterious black box, this powerful mechanism that magically allowed us to do what is otherwise "impossible".assertEntirety
The construct is based on the following meta-pattern of nested lookarounds:assertEntirety
(?<=(?=^pattern$).*)
" I can see a place somewhere behind me where I can look ahead and see ^pattern$
"
The name "lookaround" implies the relativity to our current position: we're looking around us, perhaps in front of or behind, from where we're standing. By nesting a lookahead in a lookbehind in this manner, we're able to metaphorically "fly into the skies" and look at the entire picture.
Abstracting this meta-pattern into is a bit like writing preprocessing substitution macros. Having nested lookarounds everywhere probably hurts readability and maintainability, so we encapsulate it into , which not only hides the complexity of its inner workings, but also further emphasizes its semantics by giving it an appropriate name.assertEntirety
assertEntirety
Lesson: Consider abstracting meta-patterns to hide complexity and convey semantics.
Appendix: On infinite-length lookbehind in Java
Observant readers would notice that contains a in a lookbehind, which makes its theoretical maximum length infinite. No, Java does not officially support infinite-length lookbehind. Yes, as it has been adequatedly demonstrated here, it works anyway. Officially it's categorized as a "bug"; but "someone"(*wink*) could also consider it a "hidden feature".assertEntirety
.*
It's certainly possible that this "bug" will be "fixed" in the future. Removal of this hidden feature will break this particular solution to the Java regex palindrome problem.
相关问题