Jsoup.clean，无需添加 html 实体

html java html-entities jsoup

2022-09-01 14:00:01

我正在使用从不需要的HTML标签（例如）中清除一些文本<script>

String clean = Jsoup.clean(someInput, Whitelist.basicWithImages());

问题是它取代了例如（这给我带来了麻烦，因为它不是“纯xml”）。åå

例如

Jsoup.clean("hello å <script></script> world", Whitelist.basicWithImages())

收益率

"hello &aring;  world"

但我想

"hello å  world"

有没有一种简单的方法可以实现这一目标？（即，比在结果中转换回更简单。åå

答案 1

您可以配置 Jsoup 的转义模式：使用将为您提供无实体的输出。EscapeMode.xhtml

下面是一个完整的代码段，它接受为输入，并使用：strWhitelist.simpleText()

// Parse str into a Document
Document doc = Jsoup.parse(str);

// Clean the document.
doc = new Cleaner(Whitelist.simpleText()).clean(doc);

// Adjust escape mode
doc.outputSettings().escapeMode(EscapeMode.xhtml);

// Get back the string of the body.
str = doc.body().html();

答案 2

Jsoup网站上已经有功能请求。您可以通过添加新的空 Map 和新的转义类型自行扩展源代码。如果你不想这样做，你可以使用来自apache commons的StringEscapeUtils。

public static String getTextOnlyFromHtmlText(String htmlText){
    Document doc = Jsoup.parse( htmlText );
    doc.outputSettings().charset("UTF-8");
    htmlText = Jsoup.clean( doc.body().html(), Whitelist.simpleText() );
    htmlText = StringEscapeUtils.unescapeHtml(htmlText);
    return htmlText;
}