使用 Java 和 UTF-8 编码生成有效的 XML
我正在使用 JAXP 来生成和解析一个 XML 文档,从中加载一些字段是从数据库中加载的。
用于序列化 XML 的代码:
DocumentBuilder builder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
Document doc = builder.newDocument();
Element root = doc.createElement("test");
root.setAttribute("version", text);
doc.appendChild(root);
DOMSource domSource = new DOMSource(doc);
TransformerFactory tFactory = TransformerFactory.newInstance();
FileWriter out = new FileWriter("test.xml");
Transformer transformer = tFactory.newTransformer();
transformer.setOutputProperty(OutputKeys.INDENT, "yes");
transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
transformer.transform(domSource, new StreamResult(out));
用于解析 XML 的代码:
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
factory.setNamespaceAware(true);
DocumentBuilder builder = factory.newDocumentBuilder();
Document doc = builder.parse("test.xml");
我遇到以下异常:
[Fatal Error] test.xml:1:4: Invalid byte 1 of 1-byte UTF-8 sequence.
Exception in thread "main" org.xml.sax.SAXParseException: Invalid byte 1 of 1-byte UTF-8 sequence.
at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source)
at javax.xml.parsers.DocumentBuilder.parse(Unknown Source)
at com.test.Test.xml(Test.java:27)
at com.test.Test.main(Test.java:55)
字符串文本包括 u-元音变音符和 o-元音变音符(字符代码0xFC和0xF6)。这些是导致错误的字符。当我自己转义字符串以使用ü和 ö然后问题就消失了。当我写出 XML 时,会自动对其他实体进行编码。
如何正确编写/读取输出,而无需自己替换这些字符?
(我已经阅读了以下问题: