删除 ✅ ,
我有一些字符串,其中包含各种不同的表情符号/图像/符号。
并非所有字符串都是英语 - 其中一些是其他非拉丁语语言,例如:
▓ railway??
→ Cats and dogs
I'm on
我有一些字符串,其中包含各种不同的表情符号/图像/符号。
并非所有字符串都是英语 - 其中一些是其他非拉丁语语言,例如:
▓ railway??
→ Cats and dogs
I'm on
Instead of blacklisting some elements, how about creating a whitelist of the characters you do wish to keep? This way you don't need to worry about every new emoji being added.
String characterFilter = "[^\\p{L}\\p{M}\\p{N}\\p{P}\\p{Z}\\p{Cf}\\p{Cs}\\s]";
String emotionless = aString.replaceAll(characterFilter,"");
So:
[\\p{L}\\p{M}\\p{N}\\p{P}\\p{Z}\\p{Cf}\\p{Cs}\\s]
is a range representing all numeric (\\p{N}
), letter (\\p{L}
), mark (\\p{M}
), punctuation (\\p{P}
), whitespace/separator (\\p{Z}
), other formatting (\\p{Cf}
) and other characters above U+FFFF
in Unicode (\\p{Cs}
), and newline (\\s
) characters. \\p{L}
specifically includes the characters from other alphabets such as Cyrillic, Latin, Kanji, etc. ^
in the regex character set negates the match.Example:
String str = "hello world _# 皆さん、こんにちは! 私はジョンと申します。