Java 正则表达式匹配基本多语言平面外部的字符
2022-09-02 12:49:03
如何匹配Java中Unicode基本多语言平面外部的字符(意图删除它们)?
如何匹配Java中Unicode基本多语言平面外部的字符(意图删除它们)?
要删除所有非 BMP 字符,以下操作应该有效:
String sanitizedString = inputString.replaceAll("[^\u0000-\uFFFF]", "");
您是在寻找特定角色还是 BMP 之外的所有角色?
如果是前者,则可以使用 a 来构造包含来自较高平面的码位的字符串,并且正则表达式将按预期工作:StringBuilder
String test = new StringBuilder().append("test").appendCodePoint(0x10300).append("test").toString();
Pattern regex = Pattern.compile(new StringBuilder().appendCodePoint(0x10300).toString());
Matcher matcher = regex.matcher(test);
matcher.find();
System.out.println(matcher.start());
如果您希望从字符串中删除所有非BMP字符,那么我将直接使用而不是正则表达式:StringBuilder
StringBuilder sb = new StringBuilder(test.length());
for (int ii = 0 ; ii < test.length() ; )
{
int codePoint = test.codePointAt(ii);
if (codePoint > 0xFFFF)
{
ii += Character.charCount(codePoint);
}
else
{
sb.appendCodePoint(codePoint);
ii++;
}
}