如何用JSOUP解析部分类名的html?

时间:2015-05-04 18:40:32

标签: html-parsing jsoup

我试图获得一段HTML,例如:

<tr class="myclass-1234" rel="5678">
    <td class="lst top">foo 1</td>
    <td class="lst top">foo 2</td>
    <td class="lst top">foo-5678</td>
    <td class="lst top nw" style="text-align:right;">
        <span class="nw">1.00</span> foo
    </td>
    <td class="top">01.05.2015</td>
</tr>

我对JSOUP来说是全新的,首先想到的是通过类名获取它,但事实是数字1234是动态生成的。有没有办法通过类名的一部分来获得它还是有更好的方法?

2 个答案:

答案 0 :(得分:0)

doc.select("tr[class~=myclass.*]");

Will select any div where the content of theclass attribute starts with myclass.

答案 1 :(得分:0)

Assuming a simple html containing two tr, but only one tr has the class you mentioned, this code shows how to get the tr using CSS selector:

CSS selector tr[class^=myclass] explained:

Select all elements of type "tr" with a class attribute that starts (^) with myclass:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;

public class Example {
  public static void main(String[] args) {
    String html = "<html><body><table><tr class=\"myclass-1234\" rel=\"5678\">"
      + "<td class=\"lst top\">foo 1</td>"
      + "<td class=\"lst top\">foo 2</td>"
      + "<td class=\"lst top\">foo-5678</td>"
      + "<td class=\"lst top nw\" style=\"text-align:right;\">"
      + "<span class=\"nw\">1.00</span> foo"
      + "</td>"
      + "<td class=\"top\">01.05.2015</td>"
      + "</tr><tr><td>Not to be selected</td></tr></table></body></html>";

    Document doc = Jsoup.parse(html);
    Elements selectAllTr = doc.select("tr");
    // Should be 2
    System.out.println("tr elements in html: " + selectAllTr.size());

    Elements trWithStartingClassMyClass = doc.select("tr[class^=myclass]");
    // Should be 1
    System.out.println("tr elements with class \"myclass*\" in html: " + trWithStartingClassMyClass.size());
    System.out.println(trWithStartingClassMyClass);

  }

}