未记录的 Java 正则表达式字符类：\p{C}

regex unicode java

2022-09-01 04:39:31

我在一个Java项目中发现了一个有趣的正则表达式："[\\p{C}&&\\S]"

我理解“设置交集”的意思，并且是“非空格”，但是什么是\p{C}，可以使用吗？&&\S

java.util.regex.Pattern 文档没有提到它。列表中唯一类似的类是，但它们的行为不同：它们都匹配控制字符，但在 U+FFFF 上方的 Unicode 字符上匹配两次，例如：\p{Cntrl}\p{C}PILE OF POO

public class StrangePattern {
    public static void main(String[] argv) {

        // As far as I can tell, this is the simplest way to create a String
        // with code points above U+FFFF.
        String poo = new String(Character.toChars(0x1F4A9));

        System.out.println(poo);  // prints `

答案 1

Buried down in the Pattern docs under Unicode Support, we find the following:

This class is in conformance with Level 1 of Unicode Technical Standard #18: Unicode Regular Expression, plus RL2.1 Canonical Equivalents.

...

Categories may be specified with the optional prefix Is: Both \p{L} and \p{IsL} denote the category of Unicode letters. Same as scripts and blocks, categories can also be specified by using the keyword general_category (or its short form gc) as in general_category=Lu or gc=Lu.

The supported categories are those of The Unicode Standard in the version specified by the Character class. The category names are those defined in the Standard, both normative and informative.

From Unicode Technical Standard #18, we find that C is defined to match any Other General_Category value, and that support for this is part of the requirements for Level 1 conformance. Java implements \p{C} because it claims conformance to Level 1 of UTS #18.

It probably should support \p{Other}, but apparently it doesn't.

Worse, it's violating RL1.7, required for Level 1 conformance, which requires that matching happen by code point instead of code unit:

To meet this requirement, an implementation shall handle the full range of Unicode code points, including values from U+FFFF to U+10FFFF. In particular, where UTF-16 is used, a sequence consisting of a leading surrogate followed by a trailing surrogate shall be handled as a single code point in matching.

There should be no matches for \p{C} in your test string, because your test string should be matched as a single emoji code point with General_Category=So (Other Symbol) instead of as two surrogates.

答案 2

According to https://regex101.com/, \p{C} matches

Invisible control characters and unused code points

(the \ has to be escaped because java string, so string \\p{C} is regex \p{C})

I'm guessing this is a 'hacked string check' as a \p{C} probably should never appear inside a valid (character filled) string, but the author should have left a comment as what they checked and what they wanted to check are usually 2 different things.