使用ExcelVBA进行网页抓取

时间:2018-10-30 15:36:51

标签: excel vba web-scraping

如何从Here中提取表数据?

我可以看到每一行都包含在“ team-name first”类中​​。我想将表格转入excel,但是使用from web选项时,在IE窗口中看不到表格。我认为VBA是我需要采用的途径。我尝试了一些谷歌搜索和YouTube教程,但没有取得任何成功。任何帮助将不胜感激!

snip

**编辑 对不起,我以为我附上了我的代码。问题是它没有加载整个页面。所以我认为这就是为什么我无法提取数据。

There should be a table showing here

Sub FetchNBADefense()

Dim IE As Object, obj As Object
Dim r As Long, c As Long, t As Long
Dim elemCollection As Object
Dim eRow As Long


Set IE = CreateObject("InternetExplorer.Application")

With IE

.Visible = True
.navigate ("https://stats.nba.com/teams/opponent/?sort=W&dir=-1")



While IE.readyState <> 4
    DoEvents
Wend

ThisWorkbook.Sheets("TeamDefenses").Range("A1:M60").ClearContents
Set elemColleciton = IE.document.getElementsByTagName("team-name first")
For t = 0 To (elemCollection.Length - 1)
    For r = 0 To (elemCollection(t).Rows.Cells.Length - 1)
        For c = 0 To (elemCollection(t).Rows(r).Cells.Length - 1)
        eRow = Sheet1.Cells(Rows.Count, 1).End(xlUp).Offset(1, 0).Row

        ThisWorkbook.Worksheets(1).Cells(eRow, c + 1) = elemCollection(t).Rows(r).Cells(c).innerText
        Next c
    Next r
Next t

End With
Range("A1:M60").Columns.AutoFit
'Clear memory
Set IE = Nothing

End Sub

***新代码:我缺少什么?我看到它是“ resultSet”而不是“ resultSets”,但仍然出现运行时间错误“ 424”:所需对象

Option Explicit

Public Sub FetchNBAplayerpts()

Range("A1").Select
Range(Selection, Selection.End(xlToRight)).Select
Range(Selection, Selection.End(xlDown)).Select
Selection.ClearContents

Dim json As Object
With CreateObject("MSXML2.XMLHTTP")
    .Open "GET", "https://stats.nba.com/stats/leagueLeaders?LeagueID=00&PerMode=PerGame&Scope=S&Season=2018-19&SeasonType=Regular+Season&StatCategory=PTS", False
    .setRequestHeader "If-Modified-Since", "Sat, 1 Jan 2000 00:00:00 GMT"
    .send
    Set json = JsonConverter.ParseJson(.responseText)("resultSet")(1)
End With
Dim headers As Object, header As Variant, headerOutput(), i As Long, rowInfo As Object, iRow As Object
Set headers = json("headers")
Set rowInfo = json("rowSet")
ReDim headerOutput(1 To headers.Count)
For Each header In headers
    i = i + 1
    headerOutput(i) = header
Next

Dim rowData(), r As Long, c As Long, Item As Variant
ReDim rowData(1 To rowInfo.Count, 1 To UBound(headerOutput))

For Each iRow In rowInfo
    r = r + 1: c = 1
    For Each Item In iRow
        rowData(r, c) = Item
    c = c + 1
    Next
Next

With ThisWorkbook.Worksheets("PlayerPts")
    .Cells(1, 1).Resize(1, UBound(headerOutput)) = headerOutput
    .Cells(2, 1).Resize(UBound(rowData, 1), UBound(rowData, 2)) = rowData
End With

End Sub

1 个答案:

答案 0 :(得分:3)

通过与@TylerH和@LuckyKleinschmidt的讨论,该页面似乎使用了javascript方法includes,但IE不支持该方法。这可能就是为什么页面未完全呈现(因为脚本未运行)的原因。参见here。解决方法是在相关脚本中使用indexOf方法。我想开发人员不必担心IE的市场份额很小。

Browser support:

enter image description here

如果您恰巧在Chrome / Firefox开发工具中进行检查,或者使用诸如fiddler之类的网络流量监控工具进行检查,则会发现实际上有一个XMLHTTP request发送来将数据检索到其他来源,并且实际上,您可以使用该URL发出XMLTTP请求。与打开浏览器相比,这是一种更快的检索方法,因此在这种情况下是一个双赢的选择。响应是一个JSON响应,可以使用JSON解析器进行处理。我使用JSONConverter.bas来下载并添加到您的项目中。

将上述链接中的.bas添加到您的项目后,您可以通过VBE>工具>引用> Microsoft脚本运行时添加引用。

JSON响应具有以下结构(显示了示例):

enter image description here

{意味着一个字典,因此您可以通过键访问,[意味着一个集合,因此您可以通过索引访问(或者像我一样,For Each来访问)。 ""表示字符串文字,因此您按原样阅读。测试数据类型和所需的句柄。

通过此方法检索的信息比页面上可见的更多。

输出样本:

enter image description here


VBA:

Option Explicit    
Public Sub GetTable()       
    Dim json As Object
    With CreateObject("MSXML2.XMLHTTP")
        .Open "GET", "https://stats.nba.com/stats/leaguedashteamstats?Conference=&DateFrom=&DateTo=&Division=&GameScope=&GameSegment=&LastNGames=0&LeagueID=00&Location=&MeasureType=Opponent&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PaceAdjust=N&PerMode=PerGame&Period=0&PlayerExperience=&PlayerPosition=&PlusMinus=N&Rank=N&Season=2018-19&SeasonSegment=&SeasonType=Regular+Season&ShotClockRange=&StarterBench=&TeamID=0&VsConference=&VsDivision=", False
        .setRequestHeader "If-Modified-Since", "Sat, 1 Jan 2000 00:00:00 GMT"
        .send
        Set json = JsonConverter.ParseJson(.responseText)("resultSets")(1)
    End With
    Dim headers As Object, header As Variant, headerOutput(), i As Long, rowInfo As Object, iRow As Object
    Set headers = json("headers")
    Set rowInfo = json("rowSet")
    ReDim headerOutput(1 To headers.Count)
    For Each header In headers
        i = i + 1
        headerOutput(i) = header
    Next

    Dim rowData(), r As Long, c As Long, item As Variant
    ReDim rowData(1 To rowInfo.Count, 1 To UBound(headerOutput))

    For Each iRow In rowInfo
        r = r + 1: c = 1
        For Each item In iRow
            rowData(r, c) = item
            c = c + 1
        Next
    Next

    With ThisWorkbook.Worksheets("Sheet1")
        .Cells(1, 1).Resize(1, UBound(headerOutput)) = headerOutput
        .Cells(2, 1).Resize(UBound(rowData, 1), UBound(rowData, 2)) = rowData
    End With

End Sub

开发工具中的XHR请求(“网络”标签):

enter image description here