PHP 中的 preg_match 和 UTF-8

php unicode utf-8 pcre

2022-08-30 11:26:30

我正在尝试使用preg_match搜索UTF8编码的字符串。

preg_match('/H/u', "\xC2\xA1Hola!", $a_matches, PREG_OFFSET_CAPTURE);
echo $a_matches[0][1];

这应该打印 1，因为“H”位于字符串“Hola！”中的索引 1 处。但它打印了2。因此，它似乎没有将主题视为UTF8编码的字符串，即使我在正则表达式中传递了“u”修饰符。

我在php.ini中具有以下设置，并且其他UTF8函数正在工作：

mbstring.func_overload = 7
mbstring.language = Neutral
mbstring.internal_encoding = UTF-8
mbstring.http_input = pass
mbstring.http_output = pass
mbstring.encoding_translation = Off

有什么想法吗？

答案 1

尽管 u 修饰符使模式和主体都解释为 UTF-8，但捕获的偏移量仍以字节为单位计数。

您可以使用来获取以 UTF-8 字符而不是字节为单位的长度：mb_strlen

$str = "\xC2\xA1Hola!";
preg_match('/H/u', $str, $a_matches, PREG_OFFSET_CAPTURE);
echo mb_strlen(substr($str, 0, $a_matches[0][1]));

答案 2

尝试在正则表达式之前添加以下内容 （*UTF8）：

preg_match('(*UTF8)/H/u', "\xC2\xA1Hola!", $a_matches, PREG_OFFSET_CAPTURE);

魔术，感谢 https://www.php.net/manual/function.preg-match.php#95828 的评论