正则表达式模式获取HTML表信息

时间:2016-05-12 22:27:31

标签: html regex vba web-scraping

我想用正则表达式从HTML文件中提取数据,但我不知道应该使用哪种模式。 html代码来自电子邮件。

以下是html代码的一部分。我希望能够获得" 40120 LBS"。

模式会是什么样的?

我想到了类似的东西: 装运重量[任何字符] [0-9] [0-9] [0-9] [0-9] [0-9]

..等

也许你知道一些更有效的东西来实现我想要的东西。 谢谢。

<tr style='mso-yfti-irow:8' id="row_65">
  <td width=170 valign=top style='width:127.5pt;background:white;
  padding:3.75pt 3.75pt 3.75pt 3.75pt' id="question_65">
  <p class=MsoNormal><span style='mso-fareast-font-family:"Times New Roman"'>Shipment's
  weight<o:p></o:p></span></p>
  </td>
  <td style='background:white;padding:3.75pt 3.75pt 3.75pt 3.75pt'
  id="value_65">
  <p class=MsoNormal><span style='mso-fareast-font-family:"Times New Roman"'>40120<o:p></o:p></span></p>
  </td>
 </tr>
 <tr style='mso-yfti-irow:9' id="row_116">
  <td width=170 valign=top style='width:127.5pt;background:#F3F3F3;
  padding:3.75pt 3.75pt 3.75pt 3.75pt' id="question_116">
  <p class=MsoNormal><span style='mso-fareast-font-family:"Times New Roman"'>KG
  or LBS<o:p></o:p></span></p>
  </td>
  <td style='background:#F3F3F3;padding:3.75pt 3.75pt 3.75pt 3.75pt'
  id="value_116">
  <p class=MsoNormal><span style='mso-fareast-font-family:"Times New Roman"'>LBS<o:p></o:p></span></p>
  </td>
 </tr>

2 个答案:

答案 0 :(得分:1)

在VBA中解析HTML

当然,这个解析例程并不能完全满足您的需求,它可以让您在VBA中朝着正确的方向前进。

 'Requires references to Microsoft Internet Controls and Microsoft HTML Object Library

Sub Extract_TD_text() 

    Dim URL As String 
    Dim IE As InternetExplorer 
    Dim HTMLdoc As HTMLDocument 
    Dim TDelements As IHTMLElementCollection 
    Dim TDelement As HTMLTableCell 
    Dim r As Long 

     'Saved from www vbaexpress com/forum/forumdisplay.php?f=17
    URL = "file://C:\VBAExpress_Excel_Forum.html" 

    Set IE = New InternetExplorer 

    With IE 
        .navigate URL 
        .Visible = True 

         'Wait for page to load
        While .Busy Or .readyState <> READYSTATE_COMPLETE: DoEvents: Wend 

            Set HTMLdoc = .document 
        End With 

        Set TDelements = HTMLdoc.getElementsByTagName("TD") 

        Sheet1.Cells.ClearContents 

        r = 0 
        For Each TDelement In TDelements 
             'Look for required TD elements - this check is specific to VBA Express forum - modify as required
            If TDelement.className = "alt2" And TDelement.Align = "center" Then 
                Sheet1.Range("A1").Offset(r, 0).Value = TDelement.innerText 
                r = r + 1 
            End If 
        Next 

    End Sub 

使用Regex

不建议使用正则表达式来解析HTML,因为可能出现的所有可能模糊的边缘情况,但似乎你对HTML有一些控制,所以你应该能够避免许多边缘情况的正则表达式警察在哭泣。

描述

此正则表达式将执行以下操作:

  • 将示例文本解析为单独的行
  • 收集行号
  • 收集两个纯文本值
  • 避免使用正则表达式难以解析html的许多模糊边缘案例

正则表达式

<tr\s
(?=(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\sid=(['"]?)row_([0-9]+)\1(?:\s|>))
(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*>
(?:[^<]*<(?:td|p|span)\s(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?>)+([^<]*).*?</td>
(?:[^<]*<(?:td|p|span)\s(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?>)+([^<]*).*?</td>
[^<]*</tr>

Regular expression visualization

注意:对于此正则表达式,您将需要使用以下标志:忽略空格,不区分大小写,并且点匹配所有字符。要更好地查看图像,您可以右键单击并选择在新窗口中显示。

〔实施例

给出您的示例文本

<tr style='mso-yfti-irow:8' id="row_65">
  <td width=170 valign=top style='width:127.5pt;background:white;
  padding:3.75pt 3.75pt 3.75pt 3.75pt' id="question_65">
  <p class=MsoNormal><span style='mso-fareast-font-family:"Times New Roman"'>Shipment's
  weight<o:p></o:p></span></p>
  </td>
  <td style='background:white;padding:3.75pt 3.75pt 3.75pt 3.75pt'
  id="value_65">
  <p class=MsoNormal><span style='mso-fareast-font-family:"Times New Roman"'>40120<o:p></o:p></span></p>
  </td>
 </tr>
 <tr style='mso-yfti-irow:9' id="row_116">
  <td width=170 valign=top style='width:127.5pt;background:#F3F3F3;
  padding:3.75pt 3.75pt 3.75pt 3.75pt' id="question_116">
  <p class=MsoNormal><span style='mso-fareast-font-family:"Times New Roman"'>KG
  or LBS<o:p></o:p></span></p>
  </td>
  <td style='background:#F3F3F3;padding:3.75pt 3.75pt 3.75pt 3.75pt'
  id="value_116">
  <p class=MsoNormal><span style='mso-fareast-font-family:"Times New Roman"'>LBS<o:p></o:p></span></p>
  </td>
 </tr>

正则表达式将创建以下捕获组

  • 捕获组0获取整行
  • 捕获组1获取行的id属性
  • 中行号周围的引号
  • 捕获组2获取行号
  • 捕获组3获取第一个表格单元格值
  • 捕获组4获取第二个表格单元格值

以下匹配:

[0][0] = <tr style='mso-yfti-irow:8' id="row_65">
  <td width=170 valign=top style='width:127.5pt;background:white;
  padding:3.75pt 3.75pt 3.75pt 3.75pt' id="question_65">
  <p class=MsoNormal><span style='mso-fareast-font-family:"Times New Roman"'>Shipment's
  weight<o:p></o:p></span></p>
  </td>
  <td style='background:white;padding:3.75pt 3.75pt 3.75pt 3.75pt'
  id="value_65">
  <p class=MsoNormal><span style='mso-fareast-font-family:"Times New Roman"'>40120<o:p></o:p></span></p>
  </td>
 </tr>
[0][1] = "
[0][2] = 65
[0][3] = Shipment's
  weight
[0][4] = 40120

[1][0] = <tr style='mso-yfti-irow:9' id="row_116">
  <td width=170 valign=top style='width:127.5pt;background:#F3F3F3;
  padding:3.75pt 3.75pt 3.75pt 3.75pt' id="question_116">
  <p class=MsoNormal><span style='mso-fareast-font-family:"Times New Roman"'>KG
  or LBS<o:p></o:p></span></p>
  </td>
  <td style='background:#F3F3F3;padding:3.75pt 3.75pt 3.75pt 3.75pt'
  id="value_116">
  <p class=MsoNormal><span style='mso-fareast-font-family:"Times New Roman"'>LBS<o:p></o:p></span></p>
  </td>
 </tr>
[1][1] = "
[1][2] = 116
[1][3] = KG
  or LBS
[1][4] = LBS

解释

NODE                     EXPLANATION
----------------------------------------------------------------------
  <tr                      '<tr'
----------------------------------------------------------------------
  \s                       whitespace (\n, \r, \t, \f, and " ")
----------------------------------------------------------------------
  (?=                      look ahead to see if there is:
----------------------------------------------------------------------
    (?:                      group, but do not capture (0 or more
                             times (matching the least amount
                             possible)):
----------------------------------------------------------------------
      [^>=]                    any character except: '>', '='
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      ='                       '=\''
----------------------------------------------------------------------
      [^']*                    any character except: ''' (0 or more
                               times (matching the most amount
                               possible))
----------------------------------------------------------------------
      '                        '\''
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      ="                       '="'
----------------------------------------------------------------------
      [^"]*                    any character except: '"' (0 or more
                               times (matching the most amount
                               possible))
----------------------------------------------------------------------
      "                        '"'
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      =                        '='
----------------------------------------------------------------------
      [^'"]                    any character except: ''', '"'
----------------------------------------------------------------------
      [^\s>]*                  any character except: whitespace (\n,
                               \r, \t, \f, and " "), '>' (0 or more
                               times (matching the most amount
                               possible))
----------------------------------------------------------------------
    )*?                      end of grouping
----------------------------------------------------------------------
    \s                       whitespace (\n, \r, \t, \f, and " ")
----------------------------------------------------------------------
    id=                      'id='
----------------------------------------------------------------------
    (                        group and capture to \1:
----------------------------------------------------------------------
      ['"]?                    any character of: ''', '"' (optional
                               (matching the most amount possible))
----------------------------------------------------------------------
    )                        end of \1
----------------------------------------------------------------------
    row_                     'row_'
----------------------------------------------------------------------
    (                        group and capture to \2:
----------------------------------------------------------------------
      [0-9]+                   any character of: '0' to '9' (1 or
                               more times (matching the most amount
                               possible))
----------------------------------------------------------------------
    )                        end of \2
----------------------------------------------------------------------
    \1                       what was matched by capture \1
----------------------------------------------------------------------
    (?:                      group, but do not capture:
----------------------------------------------------------------------
      \s                       whitespace (\n, \r, \t, \f, and " ")
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      >                        '>'
----------------------------------------------------------------------
    )                        end of grouping
----------------------------------------------------------------------
  )                        end of look-ahead
----------------------------------------------------------------------
  (?:                      group, but do not capture (0 or more times
                           (matching the most amount possible)):
----------------------------------------------------------------------
    [^>=]                    any character except: '>', '='
----------------------------------------------------------------------
   |                        OR
----------------------------------------------------------------------
    ='                       '=\''
----------------------------------------------------------------------
    [^']*                    any character except: ''' (0 or more
                             times (matching the most amount
                             possible))
----------------------------------------------------------------------
    '                        '\''
----------------------------------------------------------------------
   |                        OR
----------------------------------------------------------------------
    ="                       '="'
----------------------------------------------------------------------
    [^"]*                    any character except: '"' (0 or more
                             times (matching the most amount
                             possible))
----------------------------------------------------------------------
    "                        '"'
----------------------------------------------------------------------
   |                        OR
----------------------------------------------------------------------
    =                        '='
----------------------------------------------------------------------
    [^'"]                    any character except: ''', '"'
----------------------------------------------------------------------
    [^\s>]*                  any character except: whitespace (\n,
                             \r, \t, \f, and " "), '>' (0 or more
                             times (matching the most amount
                             possible))
----------------------------------------------------------------------
  )*                       end of grouping
----------------------------------------------------------------------
  >                        '>'
----------------------------------------------------------------------
  (?:                      group, but do not capture (1 or more times
                           (matching the most amount possible)):
----------------------------------------------------------------------
    [^<]*                    any character except: '<' (0 or more
                             times (matching the most amount
                             possible))
----------------------------------------------------------------------
    <                        '<'
----------------------------------------------------------------------
    (?:                      group, but do not capture:
----------------------------------------------------------------------
      td                       'td'
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      p                        'p'
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      span                     'span'
----------------------------------------------------------------------
    )                        end of grouping
----------------------------------------------------------------------
    \s                       whitespace (\n, \r, \t, \f, and " ")
----------------------------------------------------------------------
    (?:                      group, but do not capture (0 or more
                             times (matching the least amount
                             possible)):
----------------------------------------------------------------------
      [^>=]                    any character except: '>', '='
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      ='                       '=\''
----------------------------------------------------------------------
      [^']*                    any character except: ''' (0 or more
                               times (matching the most amount
                               possible))
----------------------------------------------------------------------
      '                        '\''
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      ="                       '="'
----------------------------------------------------------------------
      [^"]*                    any character except: '"' (0 or more
                               times (matching the most amount
                               possible))
----------------------------------------------------------------------
      "                        '"'
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      =                        '='
----------------------------------------------------------------------
      [^'"]                    any character except: ''', '"'
----------------------------------------------------------------------
      [^\s>]*                  any character except: whitespace (\n,
                               \r, \t, \f, and " "), '>' (0 or more
                               times (matching the most amount
                               possible))
----------------------------------------------------------------------
    )*?                      end of grouping
----------------------------------------------------------------------
    >                        '>'
----------------------------------------------------------------------
  )+                       end of grouping
----------------------------------------------------------------------
  (                        group and capture to \3:
----------------------------------------------------------------------
    [^<]*                    any character except: '<' (0 or more
                             times (matching the most amount
                             possible))
----------------------------------------------------------------------
  )                        end of \3
----------------------------------------------------------------------
  .*?                      any character (0 or more times (matching
                           the least amount possible))
----------------------------------------------------------------------
  </td>                    '</td>'
----------------------------------------------------------------------
  (?:                      group, but do not capture (1 or more times
                           (matching the most amount possible)):
----------------------------------------------------------------------
    [^<]*                    any character except: '<' (0 or more
                             times (matching the most amount
                             possible))
----------------------------------------------------------------------
    <                        '<'
----------------------------------------------------------------------
    (?:                      group, but do not capture:
----------------------------------------------------------------------
      td                       'td'
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      p                        'p'
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      span                     'span'
----------------------------------------------------------------------
    )                        end of grouping
----------------------------------------------------------------------
    \s                       whitespace (\n, \r, \t, \f, and " ")
----------------------------------------------------------------------
    (?:                      group, but do not capture (0 or more
                             times (matching the least amount
                             possible)):
----------------------------------------------------------------------
      [^>=]                    any character except: '>', '='
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      ='                       '=\''
----------------------------------------------------------------------
      [^']*                    any character except: ''' (0 or more
                               times (matching the most amount
                               possible))
----------------------------------------------------------------------
      '                        '\''
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      ="                       '="'
----------------------------------------------------------------------
      [^"]*                    any character except: '"' (0 or more
                               times (matching the most amount
                               possible))
----------------------------------------------------------------------
      "                        '"'
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      =                        '='
----------------------------------------------------------------------
      [^'"]                    any character except: ''', '"'
----------------------------------------------------------------------
      [^\s>]*                  any character except: whitespace (\n,
                               \r, \t, \f, and " "), '>' (0 or more
                               times (matching the most amount
                               possible))
----------------------------------------------------------------------
    )*?                      end of grouping
----------------------------------------------------------------------
    >                        '>'
----------------------------------------------------------------------
  )+                       end of grouping
----------------------------------------------------------------------
  (                        group and capture to \4:
----------------------------------------------------------------------
    [^<]*                    any character except: '<' (0 or more
                             times (matching the most amount
                             possible))
----------------------------------------------------------------------
  )                        end of \4
----------------------------------------------------------------------
  .*?                      any character (0 or more times (matching
                           the least amount possible))
----------------------------------------------------------------------
  </td>                    '</td>'
----------------------------------------------------------------------
  [^<]*                    any character except: '<' (0 or more times
                           (matching the most amount possible))
----------------------------------------------------------------------
  </tr>                    '</tr>'

答案 1 :(得分:1)

而不是使用RegExp来解析HTML文件,而是使用DOM解析器。

最直接的方法是添加对 Microsoft HTML对象库的引用并使用它。了解对象可能有点棘手,但不像尝试使用正则表达式处理HTML那样棘手!

关键是确定要用于提取值的规则。

这是一个(希望)演示该技术的例子。

Public Sub SimpleParser()
  Dim doc As MSHTML.HTMLDocument
  Dim b As MSHTML.HTMLBody
  Dim tr As MSHTML.HTMLTableRow, td As MSHTML.HTMLTableCell
  Dim columnNumber As Long, rowNumber As Long
  Dim trCells As MSHTML.IHTMLElementCollection
  Set doc = New MSHTML.HTMLDocument
  doc.body.innerHTML = "<table><tr style='mso-yfti-irow:8' id=""row_65""> <td width=170 valign=top style='width:127.5pt;background:white; padding:3.75pt 3.75pt 3.75pt 3.75pt' id=""question_65""> <p class=MsoNormal><span style='mso-fareast-font-family:""Times New Roman""'>Shipment's weight<o:p></o:p></span></p> </td> <td style='background:white;padding:3.75pt 3.75pt 3.75pt 3.75pt' id=""value_65""> <p class=MsoNormal><span style='mso-fareast-font-family:""Times New Roman""'>40120<o:p></o:p></span></p> </td> </tr> <tr style='mso-yfti-irow:9' id=""row_116""> <td width=170 valign=top style='width:127.5pt;background:#F3F3F3; padding:3.75pt 3.75pt 3.75pt 3.75pt' id=""question_116""> <p class=MsoNormal><span style='mso-fareast-font-family:""Times New Roman""'>KG or LBS<o:p></o:p></span></p> </td> <td style='background:#F3F3F3;padding:3.75pt 3.75pt 3.75pt 3.75pt' id=""value_116""> <p class=MsoNormal><span style='mso-fareast-font-family:""Times New Roman""'>LBS<o:p></o:p></span></p> </td> </tr></table>"
  Set b = doc.body
  'Example of looping through elements
  For Each tr In b.getElementsByTagName("tr")
    rowNumber = rowNumber + 1
    columnNumber = 0
    For Each td In tr.getElementsByTagName("td")
      columnNumber = columnNumber + 1
      Debug.Print rowNumber & "," & columnNumber, td.innerText
    Next
  Next
  'Go through each row; if the first cell is "Shipment's weight", display the next cell.
  For Each tr In b.getElementsByTagName("tr")
    Set trCells = tr.getElementsByTagName("td")
    If trCells.Item(0).innerText = "Shipment's weight" Then Debug.Print "Weight: " & trCells.Item(1).innerText
  Next

End Sub
相关问题