如何删除MySQL中不适合utf8编码的坏字符？

unicode java mysql utf-8

2022-09-03 01:52:13

我有脏数据。有时它包含这样的字符。我使用这些数据进行查询，例如

WHERE a.address IN ('mydatahere')

对于这个角色，我得到

org.hibernate.exception.GenericJDBCException：操作 ' IN ' 的排序规则（utf8_bin，IMPLICIT），（utf8mb4_general_ci，COERCIBLE），（utf8mb4_general_ci，COERCIBLE）的非法混合

如何过滤掉这样的字符？我使用Java。

谢谢。

答案 1

当我遇到这样的问题时，我使用Perl脚本来确保通过使用这样的代码将数据转换为有效的UTF-8：

use Encode;
binmode(STDOUT, ":utf8");
while (<>) {
    print Encode::decode('UTF-8', $_);
}

此脚本采用（可能已损坏）UTF-8，并将有效的 UTF-8 重新打印到。无效字符将替换为（，Unicode 替换字符）。stdinstdout�U+FFFD

如果在良好的 UTF-8 输入上运行此脚本，则输出应与输入相同。

如果数据库中有数据，则使用 DBI 扫描表并使用此方法清理所有数据以确保所有内容都有效 UTF-8 是有意义的。

这是同一脚本的 Perl 单行版本：

perl -MEncode -e "binmode STDOUT,':utf8';while(<>){print Encode::decode 'UTF-8',\$_}" < bad.txt > good.txt

编辑：添加了仅限Java的解决方案。

这是一个如何在Java中执行此操作的示例：

import java.nio.ByteBuffer;
import java.nio.CharBuffer;
import java.nio.charset.CharacterCodingException;
import java.nio.charset.Charset;
import java.nio.charset.CharsetDecoder;
import java.nio.charset.CodingErrorAction;

public class UtfFix {
    public static void main(String[] args) throws InterruptedException, CharacterCodingException {
        CharsetDecoder decoder = Charset.forName("UTF-8").newDecoder();
        decoder.onMalformedInput(CodingErrorAction.REPLACE);
        decoder.onUnmappableCharacter(CodingErrorAction.REPLACE);
        ByteBuffer bb = ByteBuffer.wrap(new byte[] {
            (byte) 0xD0, (byte) 0x9F, // 'П'
            (byte) 0xD1, (byte) 0x80, // 'р'
            (byte) 0xD0,              // corrupted UTF-8, was 'и'
            (byte) 0xD0, (byte) 0xB2, // 'в'
            (byte) 0xD0, (byte) 0xB5, // 'е'
            (byte) 0xD1, (byte) 0x82  // 'т'
        });
        CharBuffer parsed = decoder.decode(bb);
        System.out.println(parsed);
        // this prints: Пр?вет
    }
}

答案 2

您可以对 UTF-8 进行编码，然后将其解码为 UTF-8，然后从 UTF-8 解码：

String label = "look into my eyes 〠.〠";

Charset charset = Charset.forName("UTF-8");
label = charset.decode(charset.encode(label)).toString();

System.out.println(label);

输出：

look into my eyes ?.?

编辑：我认为这可能只适用于Java 6。