发现无效的 XML 字符 (Unicode: 0xc)

2022-09-01 00:30:35

使用 Java DOM 解析器解析 XML 文件会导致:

[Fatal Error] os__flag_8c.xml:103:135: An invalid XML character (Unicode: 0xc) was found in the element content of the document.
org.xml.sax.SAXParseException: An invalid XML character (Unicode: 0xc) was found in the element content of the document.
    at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(Unknown Source)
    at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(Unknown Source)
    at javax.xml.parsers.DocumentBuilder.parse(Unknown Source)

答案 1

在 XML 文档中不允许使用一些字符,即使将数据封装在 CDATA 块中也是如此。

如果生成了文档,则需要对其进行实体编码或将其剥离。如果您有一个错误的文档,则应在尝试分析之前去除这些字符。

请参阅此线程中的支石墓答案:XML 中的无效字符

他链接到这篇文章的地方:http://www.w3.org/TR/xml/#charsets

基本上,不允许0x20以下的所有字符,除了0x9 (TAB)、0xA (CR?)、0xD (LF?)


答案 2
public String stripNonValidXMLCharacters(String in) {
    StringBuffer out = new StringBuffer(); // Used to hold the output.
    char current; // Used to reference the current character.

    if (in == null || ("".equals(in))) return ""; // vacancy test.
    for (int i = 0; i < in.length(); i++) {
        current = in.charAt(i); // NOTE: No IndexOutOfBoundsException caught here; it should not happen.
        if ((current == 0x9) ||
            (current == 0xA) ||
            (current == 0xD) ||
            ((current >= 0x20) && (current <= 0xD7FF)) ||
            ((current >= 0xE000) && (current <= 0xFFFD)) ||
            ((current >= 0x10000) && (current <= 0x10FFFF)))
            out.append(current);
    }
    return out.toString();
}