无法摆脱不必要的链接

时间:2018-07-22 16:20:19

标签: vba excel-vba web-scraping

我已经在VBA中编写了一个脚本,以分析来自一些站点的一些链接(连接到contact关键字)。每个站点的一个链接。我使用xmlhttp请求来完成任务。当我执行脚本时,它会解析每个站点的链接。唯一的问题是很少有站点没有任何此类链接(连接到contact关键字),结果我的excel工作表中的输出变得混乱。更清楚一点:如果任何站点都没有这样的链接,我的刮板将使用先前的值填充该列。我将这些收集的链接存储在每个搜索的下一列中。我希望下面的图片能使您理解我的意思。

这是我到目前为止的尝试:

Sub GetConditionalLinks()
    Dim HTTP As New XMLHTTP60, Html As New HTMLDocument
    Dim post As Object, cel As Range, newlink$, R&

    For Each cel In Range("A1:A" & Cells(Rows.Count, 1).End(xlUp).Row)
        If cel.Value <> "" Then
            With HTTP
                .Open "GET", cel.Value, False
                .send
                Html.body.innerHTML = .responseText
            End With

            For Each post In Html.getElementsByTagName("a")
                If InStr(1, post.innerText, "contact", 1) > 0 Then newlink = post.getAttribute("href"): Exit For
            Next post
            cel(1, 2) = newlink
        End If
    Next cel
End Sub

我尝试过的链接(我特意将几行留空以查看脚本的行为):

https://www.yify-torrent.org/search/1080p/
https://www.houzz.com/professionals/

https://chandoo.org/forum/forums/vba-macros/
https://www.amazon.com/dp/B01LTIORC8

https://stackoverflow.com/questions
https://www.amazon.com/dp/B01LTIORC8
https://www.amazon.com/dp/B00GPAFHIO

我得到的输出:

enter image description here 我期望的输出是:

enter image description here

搜索链接位于A列中,收集的链接位于B列中。您必须已经注意到由于我的循环逻辑,收集的链接与源链接不匹配。

我的问题:

  1. 如何修复循环以获得预期的输出?

  2. 如果我模仿cel(1, 2)之类的东西,那Worksheets("SomeSheet").Range("A1")的完全限定行是什么?

2 个答案:

答案 0 :(得分:2)

我希望您能够使用类似以下的内容:

Option Explicit
Public Sub GetConditionalLinks()
    Dim HTTP As New XMLHTTP60, Html As New HTMLDocument, post As Object, i As Long, arr()
    With ActiveSheet
        arr = .Range("A1:B" & .Cells(.Rows.Count, 1).End(xlUp).Row).Value
        With HTTP
            For i = LBound(arr, 1) To UBound(arr, 1)
                If arr(i, 1) <> vbNullString Then
                    .Open "GET", arr(i, 1), False
                    .send
                    Html.body.innerHTML = .responseText

                    For Each post In Html.getElementsByTagName("a")
                        If InStr(1, post.innerText, "contact", 1) > 0 Then arr(i, 2) = post.getAttribute("href"): Exit For
                    Next post
                End If
            Next i
        End With
        .Range("A1").Resize(UBound(arr, 1), UBound(arr, 2)) = arr
    End With
End Sub

访问被拒绝的站点:

因此,我开始获得拒绝访问的权限,因此请按以下步骤重新编写。对改善错误处理的建议持开放态度。这非常简陋,但我试图避免使用GoTo语句。

Option Explicit
Public Sub GetConditionalLinks()
    Dim HTTP As New MSXML2.ServerXMLHTTP60, Html As New HTMLDocument, post As Object, i As Long, arr(), timeoutError As Boolean
    With ActiveSheet
        arr = .Range("A1:B" & .Cells(.Rows.Count, 1).End(xlUp).Row).Value
        With HTTP
            For i = LBound(arr, 1) To UBound(arr, 1)
                timeoutError = False
                If arr(i, 1) <> vbNullString Then
                    .Open "GET", arr(i, 1), False
                    On Error GoTo Errhand
                    .send
                    If Not timeoutError Then
                        Html.body.innerHTML = .responseText
                        For Each post In Html.getElementsByTagName("a")
                            If InStr(1, post.innerText, "contact", 1) > 0 Then arr(i, 2) = post.getAttribute("href"): Exit For
                        Next post
                    End If
                End If
            Next i
        End With
        .Range("A1").Resize(UBound(arr, 1), UBound(arr, 2)) = arr
    End With
    Exit Sub
Errhand:
    If Err.Number <> 0 Then
        Select Case Err.Number
        Case -2147012894 '<== Timeout; especially on access denied sites
            timeoutError = True
            Resume Next
        Case Else '<== Don't know what we are gonna do yet so let's exit
            Debug.Print Err.Number, Err.Description
        End Select
    End If
End Sub

不使用数组和循环表:

Option Explicit
Public Sub GetConditionalLinks()
    Dim HTTP As New MSXML2.ServerXMLHTTP60, Html As New HTMLDocument, cel As Range, post As Object, R As Long, timeoutError As Boolean
    Application.ScreenUpdating = False
    With ActiveSheet
        For Each cel In .Range("A1:A" & .Cells(Rows.Count, 1).End(xlUp).Row)
                R = R + 1: timeoutError = False
                If Not IsEmpty(cel) Then
                    HTTP.Open "GET", cel.Value, False
                    On Error GoTo Errhand
                    HTTP.send
                    If Not timeoutError Then
                        Html.body.innerHTML = HTTP.responseText
                        For Each post In Html.getElementsByTagName("a")
                            If InStr(1, post.innerText, "contact", 1) > 0 Then
                                .Cells(R, 2) = post.getAttribute("href"): Exit For
                            End If
                        Next post
                    End If
                End If
        Next cel
    End With
    Application.ScreenUpdating = True
    Exit Sub
Errhand:
    If Err.Number <> 0 Then
        Select Case Err.Number
        Case -2147012894                         '<== Timeout; especially on access denied sites
            timeoutError = True
            Resume Next
        Case Else
            Debug.Print Err.Number, Err.Description
        End Select
    End If
    Application.ScreenUpdating = True
End Sub

答案 1 :(得分:0)

如何执行以下操作?仅在脚本中的newlink = ""之后定义for loop才能解决此问题:

Sub GetConditionalLinks()
    Dim HTTP As New XMLHTTP60, HTML As New HTMLDocument
    Dim post As Object, elem As Object, newlink$
    Dim cel As Range, R&

    For Each cel In Range("A1:A" & Cells(Rows.Count, 1).End(xlUp).Row)
        newlink = ""             '''THIS IS THE FIX
        If cel.Value <> "" Then
            With HTTP
                .Open "GET", cel.Value, False
                .send
                HTML.body.innerHTML = .responseText
            End With

            For Each post In HTML.getElementsByTagName("a")
                If InStr(1, post.innerText, "contact", 1) > 0 Then newlink = post.getAttribute("href"): Exit For
            Next post
            cel(1, 2) = newlink
        End If
    Next cel
End Sub