Java：读取器和编码编辑

encoding java io

2022-09-05 00:00:47

Java 的默认编码是。是的？（请参阅下面的编辑）ASCII

当文本文件编码为？读者如何知道他必须使用？UTF-8UTF-8

我谈到的读者是：

FileReaders
BufferedReaders 从 sSocket
A 从ScannerSystem.in
...

编辑

它使我们的编码取决于操作系统，这意味着以下情况并非在每个操作系统上都是正确的：

'a'== 97

答案 1

读者如何知道他必须使用 UTF-8？

您通常在 InputStreamReader 中指定自己。它有一个采用字符编码的构造函数。例如：

Reader reader = new InputStreamReader(new FileInputStream("c:/foo.txt"), "UTF-8");

所有其他读者（据我所知）都使用平台默认字符编码，这本身可能确实不是正确的编码（例如-cough-）。CP-1252

理论上，您还可以根据字节顺序标记自动检测字符编码。这区分了几种 unicode 编码和其他编码。不幸的是，Java SE没有任何API，但是您可以自制一个可用于替换的API，如上面的示例所示：InputStreamReader

public class UnicodeReader extends Reader {
    private static final int BOM_SIZE = 4;
    private final InputStreamReader reader;

    /**
     * Construct UnicodeReader
     * @param in Input stream.
     * @param defaultEncoding Default encoding to be used if BOM is not found,
     * or <code>null</code> to use system default encoding.
     * @throws IOException If an I/O error occurs.
     */
    public UnicodeReader(InputStream in, String defaultEncoding) throws IOException {
        byte bom[] = new byte[BOM_SIZE];
        String encoding;
        int unread;
        PushbackInputStream pushbackStream = new PushbackInputStream(in, BOM_SIZE);
        int n = pushbackStream.read(bom, 0, bom.length);

        // Read ahead four bytes and check for BOM marks.
        if ((bom[0] == (byte) 0xEF) && (bom[1] == (byte) 0xBB) && (bom[2] == (byte) 0xBF)) {
            encoding = "UTF-8";
            unread = n - 3;
        } else if ((bom[0] == (byte) 0xFE) && (bom[1] == (byte) 0xFF)) {
            encoding = "UTF-16BE";
            unread = n - 2;
        } else if ((bom[0] == (byte) 0xFF) && (bom[1] == (byte) 0xFE)) {
            encoding = "UTF-16LE";
            unread = n - 2;
        } else if ((bom[0] == (byte) 0x00) && (bom[1] == (byte) 0x00) && (bom[2] == (byte) 0xFE) && (bom[3] == (byte) 0xFF)) {
            encoding = "UTF-32BE";
            unread = n - 4;
        } else if ((bom[0] == (byte) 0xFF) && (bom[1] == (byte) 0xFE) && (bom[2] == (byte) 0x00) && (bom[3] == (byte) 0x00)) {
            encoding = "UTF-32LE";
            unread = n - 4;
        } else {
            encoding = defaultEncoding;
            unread = n;
        }

        // Unread bytes if necessary and skip BOM marks.
        if (unread > 0) {
            pushbackStream.unread(bom, (n - unread), unread);
        } else if (unread < -1) {
            pushbackStream.unread(bom, 0, 0);
        }

        // Use given encoding.
        if (encoding == null) {
            reader = new InputStreamReader(pushbackStream);
        } else {
            reader = new InputStreamReader(pushbackStream, encoding);
        }
    }

    public String getEncoding() {
        return reader.getEncoding();
    }

    public int read(char[] cbuf, int off, int len) throws IOException {
        return reader.read(cbuf, off, len);
    }

    public void close() throws IOException {
        reader.close();
    }
}

编辑为对编辑的回复：

所以编码取决于操作系统。因此，这意味着并非在每个操作系统上都是如此：
'a'== 97

不，这不是真的。ASCII 编码（包含 128 个字符，直到带有）是所有其他字符编码的基础。只有字符集外的字符才有可能在另一种编码中以不同的方式显示。ISO-8859 编码覆盖了具有相同码位的范围内的字符。Unicode 编码覆盖具有相同码位的区域中的字符。0x000x7FASCIIASCIIISO-8859-1

您可能会发现这些博客中的每一个都是一个有趣的阅读：

绝对最低限度每个软件开发人员绝对，肯定必须了解Unicode和字符集（没有任何借口！（两者中更多是理论性的）
Unicode - 如何正确使用字符？（两者中更实用）

答案 2

Java 的默认编码取决于您的操作系统。对于Windows，它通常是“windows-1252”，对于Unix，它通常是“ISO-8859-1”或“UTF-8”。

读者知道正确的编码，因为你告诉它正确的编码。不幸的是，并非所有读者都允许您执行此操作（例如，不这样做），因此通常您必须使用.FileReaderInputStreamReader