如何使用HtmlAgilityPack解析表中的标签?

时间:2015-08-18 03:19:48

标签: vb.net .net-4.0 html-parsing html-agility-pack

我有一个html表格,其单元格值由< br>分隔。标签

<TABLE class=a12 cellSpacing=0 cols=8 cellPadding=0 border=1>
<TBODY>
    <TR>
        <TD style="WIDTH: 20.32mm"></TD>
        <TD style="WIDTH: 34mm"></TD>
        <TD style="WIDTH: 34mm"></TD>
        <TD style="WIDTH: 34mm"></TD>
        <TD style="WIDTH: 34mm"></TD>
        <TD style="WIDTH: 34mm"></TD>
        <TD style="WIDTH: 34mm"></TD>
        <TD style="WIDTH: 34mm"></TD>
    </TR>
    <TR style="HEIGHT: 5.08mm">
        <TD class=a23><DIV class=r11>Hrs</DIV></TD>
        <TD class=a24><DIV class=r11>MON</DIV></TD>
        <TD class=a25><DIV class=r11>TUE</DIV></TD>
        <TD class=a26><DIV class=r11>WED</DIV></TD>
        <TD class=a27><DIV class=r11>THU</DIV></TD>
        <TD class=a28><DIV class=r11>FRI</DIV></TD>
        <TD class=a29><DIV class=r11>SAT</DIV></TD>
        <TD class=a30><DIV class=r11>SUN</DIV></TD>
    </TR>
    <TR style="HEIGHT: 14.7mm">
        <TD class=a59><DIV class=r11>00:00</DIV></TD>
        <TD class=a60><DIV class=r11>FGH<BR>BM</DIV></TD>
        <TD class=a61><DIV class=r11>RFG8<BR>MFT5</DIV></TD>
        <TD class=a62><DIV class=r11>V5B6<BR>FG</DIV></TD>
        <TD class=a63><DIV class=r11>VB2N<BR>BN</DIV></TD>
        <TD class=a64><DIV class=r11>DFG21</DIV></TD>
        <TD class=a65><DIV class=r11>FGH<BR>MD20<BR>DHB0</DIV></TD>
        <TD class=a66><DIV class=r11>FD6<BR>HT7H4</DIV></TD>
    </TR>
    <TR style="HEIGHT: 14.7mm">
        <TD class=a59><DIV class=r11>02:00</DIV></TD>
        <TD class=a60><DIV class=r11>VN</DIV></TD>
        <TD class=a61><DIV class=r11>RTY<BR>MHF</DIV></TD>
        <TD class=a62><DIV class=r11>V5B6<BR>FG</DIV></TD>
        <TD class=a63><DIV class=r11>ZXC<BR>FHF</DIV></TD>
        <TD class=a64><DIV class=r11>DFG21<BR>GH<BR>PKJK</DIV></TD>
        <TD class=a65><DIV class=r11>FGH<BR>MD20</DIV></TD>
        <TD class=a66><DIV class=r11>FFG<BR>HFG4</DIV></TD>
    </TR>
    <TR style="HEIGHT: 14.7mm">
        <TD class=a59><DIV class=r11>04:00</DIV></TD>
        <TD class=a60><DIV class=r11>VNFG</DIV></TD>
        <TD class=a61><DIV class=r11>RTY<BR>MHF<br>T54</DIV></TD>
        <TD class=a62><DIV class=r11>CNFG</DIV></TD>
        <TD class=a63><DIV class=r11>QFCF<BR>FHF</DIV></TD>
        <TD class=a64><DIV class=r11>DFG21<BR>GH67</DIV></TD>
        <TD class=a65><DIV class=r11>SDF<BR>DFH</DIV></TD>
        <TD class=a66><DIV class=r11>CXV<BR>HFG4</DIV></TD>
    </TR>
</TBODY>

我尝试将html表转换为数据表,但单元格值是连接的。

如何解析&lt; br&gt;标签,以便可以用逗号分隔单元格值而不是组合在一起?

Private Function ParseTable(doc As HtmlDocument) As DataTable
    Dim result As New DataTable()
    Dim TableClassA12 As HtmlNode = doc.DocumentNode.SelectSingleNode("//table[@class='a12']")
    Dim rows = TableClassA12.Descendants("tr")
    Dim header = rows.Skip(1).First()

    For Each column In header.Descendants("td")
        result.Columns.Add(New DataColumn(column.InnerText.Trim, GetType(String)))
    Next

    For Each row In rows.Skip(2)
        Dim data = New List(Of String)()
        For Each column In row.Descendants("td")
            Dim cellText As String = column.InnerText.Trim
            data.Add(cellText)
        Next
        If data.Count > 0 Then
            result.Rows.Add(data.ToArray())
        End If
    Next
    Return result
End Function

1 个答案:

答案 0 :(得分:0)

对于现有代码的最小更改,您可以选择div内的td,然后访问InnerHtml以获取内部文本以及<br>标记。此时,您只需使用逗号代替<br>标记:

For Each column In row.Descendants("td").SelectMany(Function(x) x.Elements("div"))
    Dim cellText As String = column.InnerHtml.Trim.Replace("<br>",",")
    data.Add(cellText)
Next