检测编码并使所有内容都成为 UTF-8

php encoding utf-8 character-encoding

2022-08-30 06:01:58

我正在从各种RSS提要中读取大量文本，并将它们插入到我的数据库中。

当然，Feed中使用了几种不同的字符编码，例如UTF-8和ISO 8859-1。

不幸的是，文本的编码有时存在问题。例：

“Fußball”中的“ß”在我的数据库中应该看起来像这样：“ÂŸ”。如果是“ÂŸ”，则正确显示。
有时，“Fußball”中的“ß”在我的数据库中看起来像这样：“ÃƒÂŸ”。然后，当然，它被错误地显示。
在其他情况下，“ß”被保存为“ß” - 因此没有任何更改。然后它也被错误地显示。

我能做些什么来避免案例2和3？

如何使所有内容都使用相同的编码，最好是UTF-8？什么时候必须使用，什么时候必须使用（很明显效果是什么，但什么时候必须使用函数？）以及何时必须对输入不执行任何操作？utf8_encode()utf8_decode()

如何使所有内容都使用相同的编码？也许有功能？我可以为此编写一个函数吗？所以我的问题是：mb_detect_encoding()

如何了解文本使用什么编码？
如何将其转换为UTF-8 - 无论旧的编码是什么？

像这样的函数会起作用吗？

function correct_encoding($text) {
    $current_encoding = mb_detect_encoding($text, 'auto');
    $text = iconv($current_encoding, 'UTF-8', $text);
    return $text;
}

我已经测试过了，但它不起作用。这是怎么回事？

答案 1

如果应用于已是 UTF-8 字符串，它将返回乱码 UTF-8 输出。utf8_encode()

我制作了一个解决所有这些问题的函数。它被称为.Encoding::toUTF8()

您不需要知道字符串的编码是什么。它可以是 Latin1 （ISO 8859-1）、Windows-1252 或 UTF-8，或者字符串可以混合使用它们。会将所有内容转换为 UTF-8。Encoding::toUTF8()

我这样做是因为一个服务给了我一个数据馈送，所有数据都搞砸了，将UTF-8和Latin1混合在同一个字符串中。

用法：

require_once('Encoding.php');
use \ForceUTF8\Encoding;  // It's namespaced now.

$utf8_string = Encoding::toUTF8($utf8_or_latin1_or_mixed_string);

$latin1_string = Encoding::toLatin1($utf8_or_latin1_or_mixed_string);

下载：

https://github.com/neitanod/forceutf8

我包含了另一个函数，它将修复每个看起来乱码的UTF-8字符串。Encoding::fixUFT8()

用法：

require_once('Encoding.php');
use \ForceUTF8\Encoding;  // It's namespaced now.

$utf8_string = Encoding::fixUTF8($garbled_utf8_string);

例子：

echo Encoding::fixUTF8("FÃ©dÃ©ration Camerounaise de Football");
echo Encoding::fixUTF8("FÃÂ©dÃÂ©ration Camerounaise de Football");
echo Encoding::fixUTF8("FÃÂÃÂ©dÃÂÃÂ©ration Camerounaise de Football");
echo Encoding::fixUTF8("FÃÂ©dération Camerounaise de Football");

将输出：

Fédération Camerounaise de Football
Fédération Camerounaise de Football
Fédération Camerounaise de Football
Fédération Camerounaise de Football

我已将函数（）转换为一个名为的类上的一系列静态函数。新功能是。forceUTF8EncodingEncoding::toUTF8()

答案 2

首先必须检测已使用的编码。在分析 RSS 源（可能通过 HTTP）时，应从“内容类型 HTTP 标头”字段的参数中读取编码。如果不存在，请从 XML 处理指令的属性中读取编码。如果也缺少，请使用规范中定义的 UTF-8。charsetencoding

以下是我可能会做的事情：

我会使用cURL发送和获取响应。这允许您设置特定的标头字段并获取响应标头。获取响应后，您必须解析HTTP响应并将其拆分为标头和正文。然后，标头应包含包含 MIME 类型的标头字段，并且（希望）也包含具有编码/字符集的参数。如果没有，我们将分析 XML PI 是否存在该属性，并从中获取编码。如果还缺少这一点，则 XML 规范将定义为使用 UTF-8 作为编码。Content-Typecharsetencoding

$url = 'http://www.lr-online.de/storage/rss/rss/sport.xml';

$accept = array(
    'type' => array('application/rss+xml', 'application/xml', 'application/rdf+xml', 'text/xml'),
    'charset' => array_diff(mb_list_encodings(), array('pass', 'auto', 'wchar', 'byte2be', 'byte2le', 'byte4be', 'byte4le', 'BASE64', 'UUENCODE', 'HTML-ENTITIES', 'Quoted-Printable', '7bit', '8bit'))
);
$header = array(
    'Accept: '.implode(', ', $accept['type']),
    'Accept-Charset: '.implode(', ', $accept['charset']),
);
$encoding = null;
$curl = curl_init($url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_HEADER, true);
curl_setopt($curl, CURLOPT_HTTPHEADER, $header);
$response = curl_exec($curl);
if (!$response) {
    // error fetching the response
} else {
    $offset = strpos($response, "\r\n\r\n");
    $header = substr($response, 0, $offset);
    if (!$header || !preg_match('/^Content-Type:\s+([^;]+)(?:;\s*charset=(.*))?/im', $header, $match)) {
        // error parsing the response
    } else {
        if (!in_array(strtolower($match[1]), array_map('strtolower', $accept['type']))) {
            // type not accepted
        }
        $encoding = trim($match[2], '"\'');
    }
    if (!$encoding) {
        $body = substr($response, $offset + 4);
        if (preg_match('/^<\?xml\s+version=(?:"[^"]*"|\'[^\']*\')\s+encoding=("[^"]*"|\'[^\']*\')/s', $body, $match)) {
            $encoding = trim($match[1], '"\'');
        }
    }
    if (!$encoding) {
        $encoding = 'utf-8';
    } else {
        if (!in_array($encoding, array_map('strtolower', $accept['charset']))) {
            // encoding not accepted
        }
        if ($encoding != 'utf-8') {
            $body = mb_convert_encoding($body, 'utf-8', $encoding);
        }
    }
    $simpleXML = simplexml_load_string($body, null, LIBXML_NOERROR);
    if (!$simpleXML) {
        // parse error
    } else {
        echo $simpleXML->asXML();
    }
}