使用正则表达式从HTML字符串中提取所需的属性值

时间:2019-04-17 15:33:09

标签: html regex extract word

我有从Discourse API检索的HTML字符串,其中包含一些元素(p, span, div等),其中一些具有诸如data-time, data-timezone, data-email-preview等的属性。我希望属性data-email-preview上具有值并且这些值是格式为enter code here的时间戳。这些值始终位于HTML字符串内的前两个span元素之间。 HTML字符串示例:

<p><span data-date="2019-05-10" data-time="19:00:00" class="discourse-local-date" data-timezones="Europe/Brussels" data-timezone="Europe/Berlin" data-email-preview="2019-05-10T17:00:00Z UTC">2019-05-10T17:00:00Z</span> → <span data-date="2019-05-10" data-time="22:00:00" class="discourse-local-date" data-timezones="Europe/Brussels" data-timezone="Europe/Berlin" data-email-preview="2019-05-10T20:00:00Z UTC">2019-05-10T20:00:00Z</span><br>
<div class="lightbox-wrapper"><div class="meta">
<span class="filename">HackSpace_by_Sugar_Ray_Banister.jpg</span><span class="informations">1596×771 993 KB</span><span class="expand"></span>
</div></a></div></p>

我需要提取span个元素之间的这两个日期:

2019-05-10T17:00:00Z2019-05-10T20:00:00Z

4 个答案:

答案 0 :(得分:1)

(?<=>)(\d{4}\-\d{2}\-\d{2}T\d{2}\:\d{2}\:\d{2}Z)(?=<\/span>)

将为您返回所需的元素

答案 1 :(得分:0)

也许这可以满足您的需求?

https://regex101.com/r/Jo4srA/1

(根据您的需要修改了表情)

答案 2 :(得分:0)

您可以使用github上的HTML DOM库实现此目的,但是我使用sourceforge在此链接https://simplehtmldom.sourceforge.io上下载

按以下方式使用

// Create DOM from URL or file
$html = file_get_html('http://www.google.com/');

// Find all images 
foreach($html->find('img') as $element) 
echo $element->src . '<br>';

// Find all links 
foreach($html->find('a') as $element) 
echo $element->href . '<br>';

您应该将span用作

// find('span.data-email-preview')  if not work use  find('date-email-preview')

如果要使用preg_replace很简单,但是会令人困惑,因为其中有很多值,因此输出将有许多日期,那么您必须在此输出后创建数组,然后再进行循环以单行查看每个日期,因此您可以导入数据库

答案 3 :(得分:0)

在VBA中类似

Sub Extract2()

    Dim hDoc As MSHTML.HTMLDocument
    Dim hElem As MSHTML.HTMLGenericElement
    Dim sFile As String, lFile As Long
    Dim pat1 As String
    Dim sHtml As String
        strHtml = "c:\1.html"
               'read in the file
                lFile = FreeFile
                sFile = strDir & strHtml
                Open sFile For Input As lFile
                sHtml = Input$(LOF(lFile), lFile)

                'put into an htmldocument object
                Set hDoc = New MSHTML.HTMLDocument
                hDoc.body.innerHTML = sHtml

                Set dateBody = hDoc.getElementsByClassName("discourse-local-date")
                Date1 = dateBody(0).innerText
                Date2 = dateBody(1).innerText
                    MsgBox Date1 & " " & Date2
                'regex
                pat1 = ".*span.*>(.+?)<"
                Date1 = simpleRegex(sHtml, pat1, 0)
                Date2 = simpleRegex(sHtml, pat1, 1)
                    MsgBox Date1 & " " & Date2

End Sub

正则表达式的功能

Function simpleRegex(strInput As String, strPattern As String, sNr As Long)
    Dim regEx As New RegExp
    If strPattern <> "" Then
        With regEx
            .Global = True
            .MultiLine = True
            .IgnoreCase = True
            .Pattern = strPattern
        End With
        dfs = regEx.Test(strInput)
        If regEx.Test(strInput) Then
            Set sReg = regEx.Execute(strInput)
            simpleRegex = sReg(sNr).SubMatches(0)
        Else
            simpleRegex = "false"
        End If
    End If
End Function