您需要类似这样的东西(假设输入 UTF-8,并忽略 CJK(中文、日文、韩文)):
$chr_map = array(
// Windows codepage 1252
"\xC2\x82" => "'", // U+0082⇒U+201A single low-9 quotation mark
"\xC2\x84" => '"', // U+0084⇒U+201E double low-9 quotation mark
"\xC2\x8B" => "'", // U+008B⇒U+2039 single left-pointing angle quotation mark
"\xC2\x91" => "'", // U+0091⇒U+2018 left single quotation mark
"\xC2\x92" => "'", // U+0092⇒U+2019 right single quotation mark
"\xC2\x93" => '"', // U+0093⇒U+201C left double quotation mark
"\xC2\x94" => '"', // U+0094⇒U+201D right double quotation mark
"\xC2\x9B" => "'", // U+009B⇒U+203A single right-pointing angle quotation mark
// Regular Unicode // U+0022 quotation mark (")
// U+0027 apostrophe (')
"\xC2\xAB" => '"', // U+00AB left-pointing double angle quotation mark
"\xC2\xBB" => '"', // U+00BB right-pointing double angle quotation mark
"\xE2\x80\x98" => "'", // U+2018 left single quotation mark
"\xE2\x80\x99" => "'", // U+2019 right single quotation mark
"\xE2\x80\x9A" => "'", // U+201A single low-9 quotation mark
"\xE2\x80\x9B" => "'", // U+201B single high-reversed-9 quotation mark
"\xE2\x80\x9C" => '"', // U+201C left double quotation mark
"\xE2\x80\x9D" => '"', // U+201D right double quotation mark
"\xE2\x80\x9E" => '"', // U+201E double low-9 quotation mark
"\xE2\x80\x9F" => '"', // U+201F double high-reversed-9 quotation mark
"\xE2\x80\xB9" => "'", // U+2039 single left-pointing angle quotation mark
"\xE2\x80\xBA" => "'", // U+203A single right-pointing angle quotation mark
);
$chr = array_keys ($chr_map); // but: for efficiency you should
$rpl = array_values($chr_map); // pre-calculate these two arrays
$str = str_replace($chr, $rpl, html_entity_decode($str, ENT_QUOTES, "UTF-8"));
背景如下:
每个 Unicode 字符都只属于一个“常规类别”,其中可以包含引号字符的字符如下:
(这些页面便于检查您是否没有错过任何内容 - 还有一个类别索引)
有时,在启用了 Unicode 的正则表达式中匹配这些类别很有用。
此外,Unicode 字符具有“属性”,您感兴趣的属性是Quotation_Mark
。不幸的是,这些在正则表达式中无法访问。
在维基百科中,您可以找到具有Quotation_Mark
属性的字符组。最后一个引用是 PropList.txt unicode.org,但这是一个 ASCII 文本文件。
如果您还需要翻译CJK字符,则只需获取它们的码位,决定它们的转换,然后找到它们的UTF-8编码,例如,通过在 fileformat.info 中查找它(例如,对于U + 301E:http://www.fileformat.info/info/unicode/char/301e/index.htm)。
关于Windows代码页1252:Unicode定义了前256个代码点来表示与ISO-8859-1完全相同的字符,但ISO-8859-1经常与Windows代码页1252混淆,因此所有浏览器都呈现范围0x80-0x9F,这在ISO-8859-1中是“空”的(更确切地说:它包含控制字符),就好像它是Windows代码页1252一样。维基百科页面中的表格列出了Unicode等效项。
注意:strtr()
通常比 str_replace() 慢
。用你的输入和你的PHP版本来计时。如果它足够快,你可以直接使用像我这样的地图。$chr_map
如果您不确定您的输入是否采用 UTF-8 编码,并且愿意假设如果不是,则它是 ISO-8859-1 或 Windows 代码页 1252,那么您可以在执行其他任何操作之前执行此操作:
if ( !preg_match('/^\\X*$/u', $str)) {
$str = utf8_encode($str);
}
警告:此正则表达式在极少数情况下可能无法检测到非 UTF-8 编码。例如:看起来像 UTF-8 到此正则表达式(U+07C5 是 N'ko 数字 5)。这个正则表达式可以稍微增强一下,但不幸的是,可以证明不存在完全万无一失的编码检测问题解决方案。"Gruß…"/*CP-1252*/=="Gru\xDF\x85"
如果要将源自 Windows 代码页 1252 0x80 0x9F范围规范化为常规 Unicode 代码点,可以执行此操作(并删除上述内容的第一部分):$chr_map
$normalization_map = array(
"\xC2\x80" => "\xE2\x82\xAC", // U+20AC Euro sign
"\xC2\x82" => "\xE2\x80\x9A", // U+201A single low-9 quotation mark
"\xC2\x83" => "\xC6\x92", // U+0192 latin small letter f with hook
"\xC2\x84" => "\xE2\x80\x9E", // U+201E double low-9 quotation mark
"\xC2\x85" => "\xE2\x80\xA6", // U+2026 horizontal ellipsis
"\xC2\x86" => "\xE2\x80\xA0", // U+2020 dagger
"\xC2\x87" => "\xE2\x80\xA1", // U+2021 double dagger
"\xC2\x88" => "\xCB\x86", // U+02C6 modifier letter circumflex accent
"\xC2\x89" => "\xE2\x80\xB0", // U+2030 per mille sign
"\xC2\x8A" => "\xC5\xA0", // U+0160 latin capital letter s with caron
"\xC2\x8B" => "\xE2\x80\xB9", // U+2039 single left-pointing angle quotation mark
"\xC2\x8C" => "\xC5\x92", // U+0152 latin capital ligature oe
"\xC2\x8E" => "\xC5\xBD", // U+017D latin capital letter z with caron
"\xC2\x91" => "\xE2\x80\x98", // U+2018 left single quotation mark
"\xC2\x92" => "\xE2\x80\x99", // U+2019 right single quotation mark
"\xC2\x93" => "\xE2\x80\x9C", // U+201C left double quotation mark
"\xC2\x94" => "\xE2\x80\x9D", // U+201D right double quotation mark
"\xC2\x95" => "\xE2\x80\xA2", // U+2022 bullet
"\xC2\x96" => "\xE2\x80\x93", // U+2013 en dash
"\xC2\x97" => "\xE2\x80\x94", // U+2014 em dash
"\xC2\x98" => "\xCB\x9C", // U+02DC small tilde
"\xC2\x99" => "\xE2\x84\xA2", // U+2122 trade mark sign
"\xC2\x9A" => "\xC5\xA1", // U+0161 latin small letter s with caron
"\xC2\x9B" => "\xE2\x80\xBA", // U+203A single right-pointing angle quotation mark
"\xC2\x9C" => "\xC5\x93", // U+0153 latin small ligature oe
"\xC2\x9E" => "\xC5\xBE", // U+017E latin small letter z with caron
"\xC2\x9F" => "\xC5\xB8", // U+0178 latin capital letter y with diaeresis
);
$chr = array_keys ($normalization_map); // but: for efficiency you should
$rpl = array_values($normalization_map); // pre-calculate these two arrays
$str = str_replace($chr, $rpl, $str);