如何在Java中解开HTML字符实体？

string html eclipse java decode

2022-08-31 07:00:21

基本上，我想解码一个给定的Html文档，并替换所有特殊字符，例如->，->。" "" "">"">"

在 .NET 中，我们可以利用 .HttpUtility.HtmlDecode

Java中的等效函数是什么？

答案 1

我使用了Apache Commons StringEscapeUtils.unescapeHtml4（）来实现这一点：

取消将包含实体转义的字符串转义为包含与转义对应的实际 Unicode 字符的字符串。支持 HTML 4.0 实体。

答案 2

其他答案中提到的库将是很好的解决方案，但是如果您已经碰巧在项目中挖掘现实世界的html，那么Jsoup项目可以提供的不仅仅是管理“&符号和磅FFFF分号”的东西。

// textValue: <p>This is a&nbsp;sample. \"Granny\" Smith &#8211;.<\/p>\r\n
// becomes this: This is a sample. "Granny" Smith –.
// with one line of code:
// Jsoup.parse(textValue).getText(); // for older versions of Jsoup
Jsoup.parse(textValue).text();

// Another possibility may be the static unescapeEntities method:
boolean strictMode = true;
String unescapedString = org.jsoup.parser.Parser.unescapeEntities(textValue, strictMode);

您还可以获得方便的API，用于提取和操作数据，使用最好的DOM，CSS和类似jquery的方法。它是开源和MIT许可证。