获取页面的可见文本

selenium-webdriver java

2022-09-01 01:31:41

如何在没有HTML标签的情况下使用硒网络驱动程序获取网页的可见文本部分？

我需要一些与 Htmlunit 中的函数 HtmlPage.asText（）等效的东西。

使用函数WebDriver.getSource获取文本并使用jsoup解析是不够的，因为页面中可能存在隐藏元素（通过外部CSS），我对它们不感兴趣。

答案 1

执行（或其他一些选择器来选择顶部元素），然后对该元素执行将返回所有可见文本。By.tagName("body")getText()

答案 2

我可以用C#硒帮助你。

通过使用此功能，您可以选择该特定页面上的所有文本，并将其保存到首选位置的文本文件中。

确保你正在使用这个东西：

using System.IO;
using System.Text;
using OpenQA.Selenium;
using OpenQA.Selenium.Support.UI;

到达特定页面后，请尝试使用此代码。

IWebElement body = driver.FindElement(By.TagName("body"));
var result = driver.FindElement(By.TagName("body")).Text;

// Folder location
var dir = @"C:Textfile" + DateTime.Now.ToShortDateString();

// If the folder doesn't exist, create it
if (!Directory.Exists(dir))
Directory.CreateDirectory(dir);

// Creates a file copiedtext.txt with all the contents on the page.
File.AppendAllText(Path.Combine(dir, "Copiedtext.txt"), result);