这个Java正则表达式如何检测回文？

regex java palindrome nested-reference lookaround

2022-09-01 23:41:48

^{这是一系列教育正则表达式文章中的第三部分。接下来是这个正则表达式如何找到三角形数字的？（嵌套引用首次引入的地方）和我们如何将a^n b^n与Java正则表达式相匹配？（进一步阐述了前瞻“计数”机制）。这部分介绍了一种特定形式的嵌套断言，当与嵌套引用结合使用时，Java正则表达式可以与大多数人认为“不可能”的东西相匹配：回文！}

回文的语言是非常规的;它实际上是上下文无关的（对于给定的字母表）。也就是说，现代正则表达式实现识别的不仅仅是常规语言，以及Perl / PCRE的递归模式和.NET的平衡组可以很容易地识别回文（参见：相关问题）。

但是，Java的正则表达式引擎既不支持这些“高级”功能。然而，“某人”^（*wink*）设法编写了以下正则表达式，这似乎做得很好（另请参阅 ideone.com）：

public class Palindrome {
    // asserts that the entirety of the string matches the given pattern
    static String assertEntirety(String pattern) {
        return "(?<=(?=^pattern$).*)".replace("pattern", pattern);
    }

    public static void main(String[] args) {
        final String PALINDROME =
            "(?x) | (?:(.) add)+ chk"
                .replace("add", assertEntirety(".*? (\\1 \\2?)"))
                .replace("chk", assertEntirety("\\2"));

        System.out.println(PALINDROME);
        // (?x) | (?:(.) (?<=(?=^.*? (\1 \2?)$).*))+ (?<=(?=^\2$).*)
        
        String[] tests = {
            "",     // true
            "x",    // true
            "xx",   // true
            "xy",   // false
            "xyx",  // true
            "xxx",  // true
            "xxyx", // false
            "racecar",                // true
            "step on no pets",        // true
            "aManaPlanaCanalPanaMa",  // true
            "this is impossible",     // FALSE!!!
        };
        for (String test : tests) {
            System.out.printf("[%s] %s%n", test, test.matches(PALINDROME));
        }
    }
}

所以这似乎有效，但如何工作？

引用

常识警报!!!

这不是检测回文的最佳方法;充其量只是。使用更通用的编程语言执行此检测既更有效又更直接。O(N^3)

您不希望使用正则表达式来检测回文，原因与您不希望使用正则表达式查找质数的原因相同。也就是说，您将研究非递归非平衡组正则表达式如何检测回文，原因与您研究正则表达式如何用于素数测试的原因相同：它很有趣，具有挑战性，具有教育意义。

大局观

我们将首先从整体大局算法中查看此正则表达式，然后稍后仔细查看具体的实现细节。正则表达式几乎是以下 Java 代码的直接翻译：

static boolean isPalindrome(String s) {
   if (s.isEmpty()) {
      return true;
   }
   String g2 = null;
   for (char ch : s.toCharArray()) {
      String g1 = String.valueOf(ch);
      // "add"
      if (g2 != null && s.endsWith(g1 + g2)) {
         g2 = g1 + g2;
      } else if (s.endsWith(g1)) {
         g2 = g1;
      } else {
         break;
      }
   }
   return s.equals(g2); // "chk"
}

这显然不是检查回文的最直接/最有效的Java代码，但它是有效的，最令人着迷的是，它几乎可以直接转换为正则表达式，并具有近乎一对一的映射。这里再次是正则表达式，为方便起见，在此处复制，并进行了注释以突出显示惊人的相似之处：

//  isEmpty  _for-loop_
//       ↓  /          \
    "(?x) | (?:(.) add)+ chk"
//             \_/  ↑
//             g1   loop body                   ___g2___
//                                             /        \
           .replace("add", assertEntirety(".*? (\\1 \\2?)"))
           .replace("chk", assertEntirety("\\2"));
                           // s.equals(g2)

附件：ideone.com 上源代码的注释和扩展版本

（现在随意忽略的细节：只需将其视为黑盒正则表达式机制，无论我们当前的位置如何，都可以对整个字符串进行断言。assertEntirety

因此，基本算法是，当我们从左到右扫描字符串时，我们尝试构建一个后缀，受回文约束的约束。然后，我们检查是否能够以这种方式构建完整的字符串。如果可以的话，那么字符串就是一个回文。此外，作为一个特殊情况，空字符串是微不足道的回文。

一旦理解了大局算法，我们就可以检查正则表达式模式如何实现它。

所有的？`String.replace`

Java中的正则表达式模式最终只不过是字符串，这意味着它们可以通过字符串操作来派生，就像任何字符串一样。是的，我们甚至可以使用正则表达式来生成正则表达式模式 - 一种元正则表达式方法，如果你愿意的话。

考虑这个初始化常量的示例（它最终只包含一个数字）：int

final int X = 604800;
final int Y = 60 * 60 * 24 * 7;
// now X == Y

分配给的数字是一个文字整数：我们可以清楚地看到数字是什么。使用表达式的情况并非如此，但是此公式似乎传达了此数字表示的含义。即使没有对这些常量进行适当的命名，我们仍然会得到可能表示一周内秒数的想法，即使我们可能不会立即知道数值是什么。另一方面，我们确切地知道这个数字，但我们对它代表什么的想法知之甚少。XYYX

在代码段中使用字符串替换是一种类似的情况，但对于字符串正则表达式模式。有时，从更简单的部分对该值进行系统和逻辑派生（“公式”）而不是将模式显式编写为一个文本字符串，这样会更有意义。对于正则表达式尤其如此，在正则表达式中，我们了解模式的作用通常比能够看到它作为字符串文本的外观更重要（无论如何，这并不是一个外观，所有额外的反斜杠是什么）。

为方便起见，此处再次复制了部分代码段：

// the "formula"
     final String PALINDROME =
        "(?x) | (?:(.) add)+ chk"
           .replace("add", assertEntirety(".*? (\\1 \\2?)"))
           .replace("chk", assertEntirety("\\2"));

// the "value"
     System.out.println(PALINDROME);
     //                       ____add_____             chk_
     //               _______/            \____   _______/ \_____
     // (?x) | (?:(.) (?<=(?=^.*? (\1 \2?)$).*))+ (?<=(?=^\2$).*)
     //        |  \_/             \______/     |
     //        |   1                 2         |
     //        |_______________________________|

毫无疑问，在这种情况下，“公式”比最终的字符串“value”更具可读性。

当然，有更复杂的方法来以编程方式生成正则表达式模式，并且当然可以以混淆而不是强调其含义的方式编写，但是即使使用简单的字符串替换，也有意识的使用仍然会令人惊奇（希望在本例中所示）。

课程：考虑以编程方式生成正则表达式模式。

如何运作？`add`

该结构是执行某种“计数”的断言，在前两部分中已经进行了彻底的讨论。不过，有两个功能值得注意：(?:(.) add)+add

捕获到组 1 中，允许稍后进行反向引用(.)
断言不仅仅是从我们目前的立场向前看assertEntirety
- 我们稍后将更详细地讨论这一点;只需将其视为对整个字符串进行断言的一种方式

应用于中的模式如下：assertEntiretyadd

# prefix   _suffix_
#    ↓    /        \
    .*?   ( \1 \2? )
#         \________/   i.e. a reluctant "whatever" prefix (as short as possible)
#          group 2          followed by a suffix captured into group 2

请注意，组 2 是使用可选说明符进行自引用，本系列的第 2 部分中已经讨论了该技术。毋庸置疑，组 2 是我们在这种模式中的“计数器”：它是一个后缀，我们将在“循环”的每次迭代中尝试向左扩展。当我们从左到右迭代每个字符时，我们尝试在后缀中附加相同的字符（使用反向引用）。(.)\1

再次回想一下上述模式的 Java 代码转换，为方便起见，请在此处转载：

if (g2 != null && s.endsWith(g1 + g2)) {   // \2? is greedy, we try this first
   g2 = g1 + g2;
} else if (s.endsWith(g1)) {    // since \2? is optional, we may also try this
   g2 = g1;
} else {        // if there's no matching suffix, we "break" out of the "loop"
   break;
}

可选的事实意味着以下几点：\2?

它为自我参考提供了一个“基本案例”（我们这样做的主要原因！
由于是后缀模式的一部分（因此在整体模式的后面出现），因此前缀部分必须不情愿，因此不是。这允许行使其贪婪。\2?.*?.*\2?
“计数器”也可能“重置”并给出“错误”的结果
- 在第 2 部分中，我们展示了回溯如何导致相同类型的有问题的重置?
  - 我们通过使用所有格量词解决了这个问题，但这在这里不适用?+

第三点将在下一节中进一步阐述。

教训：仔细分析模式中贪婪/不情愿的重复之间的相互作用。

Why do we need a phase?`chk`

As alluded in the previous section, the optional and backtrackable means that our suffix can shrink under some circumstances. We will examine such a scenario step by step for this input:\2?

 x y x y z y x
↑
# Initial state, \2 is "uninitialized"
             _
(x)y x y z y x
  ↑
  # \1 captured x, \2 couldn't match \1\2 (since \2 is "uninitialized")
  #                but it could match \1 so it captured x
           ___
 x(y)x y z y x
    ↑
    # \1 captured y, \2 matched \1\2 and grew to capture yx
             _
 x y(x)y z y x
      ↑
      # \1 captured x, \2 couldn't match \1\2,
      #                but it could match \1 so it shrunk to capture x (!!!)
           ___
 x y x(y)z y x
        ↑
        # \1 captured y, \2 matched \1\2 and grew to capture yx
         _____
 x y x y(z)y x
          ↑
          # \1 captured z, \2 matched \1\2 and grew to capture zyx
       _______
 x y x y z(y)x
            ↑
            # \1 captured y, \2 matched \1\2 and grew to capture yzyx
     _________
 x y x y z y(x)
              ↑
              # \1 captured x, \2 matched \1\2 and grew to capture xyzyx

We can modify our pattern (and the corresponding Java code) to omit the phase, and see that indeed this is what happens:chk

    // modified pattern without a chk phase; yields false positives!
    final String PALINDROME_BROKEN =
        "(?x) | (?:(.) add)+"
            .replace("add", assertEntirety(".*? (\\1 \\2?)"));

    String s = "xyxyzyx"; // NOT a palindrome!!!
    
    Matcher m = Pattern.compile(PALINDROME_BROKEN).matcher(s);
    if (m.matches()) {
        System.out.println(m.group(2)); // prints "xyzyx"
    }

As explained, , which is NOT a palindrome, is falsely reported as one, because we didn't check if the growing suffix eventually became the complete string (which it clearly did not in this case). The phase (which is an of the pattern ) is therefore an absolute necessity in our setup. We need to confirm that indeed we managed to grow our suffix all the way. If this is the case, then we have ourselves a palindrome."xyxyzyx"chkassertEntirety\2

Lesson: Carefully analyze any possibly unintended side-effects of optional self-reference matching.

The Main Course: `assertEntirety`

While it's neat that we can write a Java regex pattern to detect palindromes, everything here except has already been covered in the previous parts of the series. The only new thing here is this mysterious black box, this powerful mechanism that magically allowed us to do what is otherwise "impossible".assertEntirety

The construct is based on the following meta-pattern of nested lookarounds:assertEntirety

(?<=(?=^pattern$).*)

" I can see a place somewhere behind me where I can look ahead and see ^pattern$ "

The name "lookaround" implies the relativity to our current position: we're looking around us, perhaps in front of or behind, from where we're standing. By nesting a lookahead in a lookbehind in this manner, we're able to metaphorically "fly into the skies" and look at the entire picture.

Abstracting this meta-pattern into is a bit like writing preprocessing substitution macros. Having nested lookarounds everywhere probably hurts readability and maintainability, so we encapsulate it into , which not only hides the complexity of its inner workings, but also further emphasizes its semantics by giving it an appropriate name.assertEntiretyassertEntirety

Lesson: Consider abstracting meta-patterns to hide complexity and convey semantics.

Appendix: On infinite-length lookbehind in Java

Observant readers would notice that contains a in a lookbehind, which makes its theoretical maximum length infinite. No, Java does not officially support infinite-length lookbehind. Yes, as it has been adequatedly demonstrated here, it works anyway. Officially it's categorized as a "bug"; but "someone"^(*wink*) could also consider it a "hidden feature".assertEntirety.*

It's certainly possible that this "bug" will be "fixed" in the future. Removal of this hidden feature will break this particular solution to the Java regex palindrome problem.

这个Java正则表达式如何检测回文？

引用

常识警报!!!

相关问题

大局观

所有的？String.replace

如何运作？add

相关问题

Why do we need a phase?chk

The Main Course: assertEntirety

Appendix: On infinite-length lookbehind in Java

相关问题

所有的？`String.replace`

如何运作？`add`

Why do we need a phase?`chk`

The Main Course: `assertEntirety`