我有一个网页,其中包含以下HTML代码:
....
<div class="some_class">Text I want
<span class="another_class">Text I don't want</span>
....more junk...
....a lot more junk....
</div>
我试着调用getText(),它给了我div中的所有文本,其中包含了许多我不想要的其他文本。我的问题是:如何才能获得我想要的文本,而无需借助某种解析或提取子字符串等?
谢谢!
答案 0 :(得分:0)
您需要使用JavaScript执行此操作。如果您的webDriver(如Firefox,Chrome等)支持,您可以这样做:
public class Main
{
//@formatter:off
final static String JS_SCRIPT_GET_TEXT = "var element = arguments[0]; " +
"var text = ''; " +
"for (var i = 0; i < element.childNodes.length; i++) " +
" if (element.childNodes[i].nodeType === Node.TEXT_NODE) " +
" { " +
" text += element.childNodes[i].textContent + ' '; " +
" } " +
"return text; ";
//@formatter:on
public static void main(final String[] args)
{
final FirefoxDriver driver = new FirefoxDriver();
driver.get("http://en.wikipedia.org/wiki/HTML");
final List<WebElement> findElementsByCssSelector = driver.findElementsByCssSelector("#mw-content-text div");
final WebElement webElement = findElementsByCssSelector.get(0);
final String extractInnerText = extractInnerText(webElement, driver);
System.out.println("---------------------");
System.out.println("Seleniums .getText():\n" + webElement.getText());
System.out.println("\n\n---------------------");
System.out.println("Just the node text:\n" + extractInnerText);
}
public static String extractInnerText(final WebElement webElement, final WebDriver webDriver)
{
final JavascriptExecutor javascriptExecutor = (JavascriptExecutor) webDriver;
String webElementText = (String) javascriptExecutor.executeScript(JS_SCRIPT_GET_TEXT, webElement);
webElementText = webElementText.trim();
return webElementText;
}
}
对于此示例Html:
<div class="dablink">
For the use of HTML on Wikipedia, see
<a href="/wiki/Help:HTML_in_wikitext" title="Help:HTML in wikitext">Help:HTML in wikitext</a>
.
</div>
它会打印出来:
---------------------
Seleniums .getText():
For the use of HTML on Wikipedia, see Help:HTML in wikitext.
---------------------
Just the node text:
For the use of HTML on Wikipedia, see .
我认为那就是你需要的。您可以将extractInnerText(..)
方法应用于每个webelement。