如何在JSOUP中解析多个html元素?

时间:2017-12-13 20:17:57

标签: java html jsoup

我试图从一个警察局(Garda是警察爱尔兰人)的一个简单的html犯罪统计表中解析一个java项目中保存的HTML文档。目前我正在尝试解析html文档中的内容并将其打印到控制台。我遇到的问题是,我只能打印表中的数字(不包括年份),但我想要实现的是从表中的犯罪名称,后跟6个数字,跟随。

Screenshot of the html table (Cannot embed the image as my reputation is too low)

HTML TABLE

<html><head><meta http-equiv="Content-Type" content="text/html; charset=windows-1252">
<title>Recorded Crime Offences (Number) by Garda Station, Type of Offence and&lt;BR&gt;
Year</title>
</head>
<body>
<table border="">
<tbody><tr align="LEFT">
<th colspan="8">Recorded Crime Offences (Number) by Garda Station, Type of Offence and<br>
Year</th>
</tr>
<tr align="LEFT">
<th colspan="2"> </th>
<th valign="TOP" colspan="1">2011</th>
<th valign="TOP" colspan="1">2012</th>
<th valign="TOP" colspan="1">2013</th>
<th valign="TOP" colspan="1">2014</th>
<th valign="TOP" colspan="1">2015</th>
<th valign="TOP" colspan="1">2016</th>
</tr>
<tr align="RIGHT">
<th align="LEFT" valign="TOP" rowspan="12">Balbriggan, D.M.R. Northern Division</th>
<th align="LEFT">03 ,Attempts/threats to murder, assaults, harassments and related offences</th>
<td>96</td>
<td>89</td>
<td>70</td>
<td>97</td>
<td>103</td>
<td>103</td>
</tr>
<tr align="RIGHT">
<th align="LEFT">04 ,Dangerous or negligent acts</th>
<td>72</td>
<td>67</td>
<td>50</td>
<td>53</td>
<td>45</td>
<td>43</td>
</tr>
<tr align="RIGHT">
<th align="LEFT">05 ,Kidnapping and related offences</th>
<td>0</td>
<td>0</td>
<td>1</td>
<td>3</td>
<td>0</td>
<td>7</td>
</tr>
<tr align="RIGHT">
<th align="LEFT">06 ,Robbery, extortion and hijacking offences</th>
<td>16</td>
<td>19</td>
<td>16</td>
<td>7</td>
<td>11</td>
<td>13</td>
</tr>
<tr align="RIGHT">
<th align="LEFT">07 ,Burglary and related offences</th>
<td>177</td>
<td>190</td>
<td>157</td>
<td>140</td>
<td>151</td>
<td>139</td>
</tr>
<tr align="RIGHT">
<th align="LEFT">08 ,Theft and related offences</th>
<td>510</td>
<td>466</td>
<td>495</td>
<td>542</td>
<td>445</td>
<td>302</td>
</tr>
<tr align="RIGHT">
<th align="LEFT">09 ,Fraud, deception and related offences</th>
<td>66</td>
<td>76</td>
<td>126</td>
<td>114</td>
<td>98</td>
<td>66</td>
</tr>
<tr align="RIGHT">
<th align="LEFT">10 ,Controlled drug offences</th>
<td>113</td>
<td>100</td>
<td>64</td>
<td>55</td>
<td>44</td>
<td>80</td>
</tr>
<tr align="RIGHT">
<th align="LEFT">11 ,Weapons and Explosives Offences</th>
<td>22</td>
<td>18</td>
<td>13</td>
<td>10</td>
<td>19</td>
<td>17</td>
</tr>
<tr align="RIGHT">
<th align="LEFT">12 ,Damage to property and to the environment</th>
<td>257</td>
<td>266</td>
<td>269</td>
<td>203</td>
<td>213</td>
<td>177</td>
</tr>
<tr align="RIGHT">
<th align="LEFT">13 ,Public order and other social code offences</th>
<td>168</td>
<td>115</td>
<td>93</td>
<td>78</td>
<td>79</td>
<td>92</td>
</tr>
<tr align="RIGHT">
<th align="LEFT">15 ,Offences against government, justice procedures and organisation of crime</th>
<td>45</td>
<td>48</td>
<td>39</td>
<td>39</td>
<td>66</td>
<td>50</td>
</tr>
<tr align="LEFT">
<td colspan="8"><a href="http://www.cso.ie/en/methods/crime/recordedcrime/">See Background Notes</a> 
</td>
</tr>
</tbody></table>

</body></html>

我目前提出的代码可以打印出这样的数字

Figure 0 : 96
Figure 1 : 89
Figure 2 : 70
Figure 3 : 97
Figure 4 : 103
Figure 5 : 103
Figure 6 : 72
Figure 7 : 67
Figure 8 : 50
Figure 9 : 53
Figure 10 : 45
... (Figures 11-66 omitted for conciseness)
Figure 67 : 48
Figure 68 : 39
Figure 69 : 39
Figure 70 : 66
Figure 71 : 50

然而我喜欢它的显示方式更像是

Crime: 03 ,Attempts/threats to murder, assaults, harassments and related offences
Figure 0 : 96
Figure 1 : 89
Figure 2 : 70
Figure 3 : 97
Figure 4 : 103
Figure 5 : 103

Crime: 04 ,Dangerous or negligent acts
Figure 6 : 72
Figure 7 : 67
Figure 8 : 50
Figure 9 : 53
Figure 10 : 45
etc, etc

我尝试了许多不同的方法,例如添加一个for循环来访问带有犯罪的th元素,然后另一个用数字访问td元素,但这通常会导致像

Exception in thread "main" java.lang.IndexOutOfBoundsException: Index: 0, Size: 0  

工作解析器类

import java.io.*;   
import org.jsoup.*; 
import org.jsoup.nodes.Document; 
import org.jsoup.nodes.Element; 
import org.jsoup.select.Elements;

public class ParseCrimeStatistics {

    public static void main(String[]args) {
    try {

        int count = 0;
            File input = new File("Balbriggan.html");
            Document doc =Jsoup.parse(input, "UTF-8", "http://www.cso.ie");

            Elements title = doc.select("td");

                for(Element sectd1:title){
                    Elements ths = sectd1.select("td"); 

                    String result = ths.get(0).text();

                    System.out.println("Figure " + count  + " : "+ result);

                    count++;

    }
    }catch (IOException e) {
        e.printStackTrace();
    }
}
}

有人会对我如何解决这个问题有任何建议吗?谢谢。

1 个答案:

答案 0 :(得分:2)

试试这个,

int count = 0;
File input = new File("Balbriggan.html");
Document doc = Jsoup.parse(input, "UTF-8", "http://www.cso.ie");

Elements numbers = doc.select("td");
Elements titles = doc.select("th");


for(int i=9; i<titles.size(); i++)
{
    System.out.println("Crime: " + titles.get(i).text());  
    for(int j=0; j<6; j++)
    {
        System.out.println("Figure " + count + ":" + numbers.get((i-9)*6+j).text());
        count++;
    }
}