抓取数据时缺少信息

时间:2017-07-18 09:15:33

标签: html r dom web-crawler

我想使用R抓取与XXX中AlphaGo相关的所有新闻(标题,网址和文字),页面网址为http://www.xxxxxx.com/search/?q=AlphaGo。这是我的代码:

url <- "http://www.xxxxxx.com/search/?q=AlphaGo"
info <- debugGatherer()
handle <- getCurlHandle(cookiejar ="",
                        #turn the page
                        followlocation = TRUE,
                        autoreferer = TRUE,
                        debugfunc = info$update,
                        verbose = TRUE,
                        httpheader = list(
                          from = "eddie@r-datacollection.com",
                          'user-agent' = str_c(R.version$version.string,
                                               ",",R.version$platform)
                        ))
html <- getURL(url,curl=handle,header = TRUE)
parsedpage <- htmlParse(html)

然而,当我使用代码时

xpathSApply(parsedpage,"//h3//a",xmlGetAttr,"href")

检查我是否找到了目标代码,我发现缺少相关新闻信息的所有内容。然后我发现按下DOM elements之后F12(Chrome就是我用过的)包含了我想要的信息,而sources中没有任何内容(这些内容非常混乱,就像所有元素一样堆积起来一起)。所以我将我的代码更改为:

parsed_page <- htmlTreeParse(file = url,asTree = T)

希望获得dom树。 不过,这次信息丢失了,我发现所有遗漏的信息都是DOM elements中折叠的信息(我之前从未遇到过这种情况)。

知道问题是如何发生的,以及如何解决这个问题?

2 个答案:

答案 0 :(得分:0)

问题不是来自您的代码。结果页面是动态生成的,因此结果页面中的纯HTML不提供链接和文本(如果查看源代码,可以看到)。

只有10个结果,所以我建议你手动创建一个url列表。

我不知道您在此代码中使用的包。但我建议你选择rvest,这似乎比你使用的包更简单。

对于:

url <- "http://money.cnn.com/2017/05/25/technology/alphago-china-ai/index.html"

library(rvest)
library(tidyverse)

url %>%
  read_html() %>%
  html_nodes(xpath = '//*[@id="storytext"]/p') %>% 
  html_text()

 [1] " A computer system that Google engineers trained to play the game Go beat the world's best human player Thursday in China. The victory was AlphaGo's second this week over Chinese professional Ke Jie, clinching the best-of-three series at the Future of Go Summit in Wuzhen.  "                                  
 [2] " Afterward, Google engineers said AlphaGo estimated that the first 50 moves -- by both players -- were virtually perfect. And the first 100 moves were the best anyone had ever played against AlphaGo's master version. "                                                                                           
 [3] " Related: Google's man-versus-machine showdown is blocked in China "                                                                                                                                                                                                                                                 
 [4] " \"What an amazing and complex game! Ke Jie pushed AlphaGo right to the limit,\" said DeepMind CEO Demis Hassabis on Twitter. DeepMind is a British artificial intelligence company that developed AlphaGo and was purchased by Google in 2014. "                                                                    
 [5] " DeepMind made a stir in January 2016 when it first announced it had used artificial intelligence to master Go, a 2,500-year-old game. Computer scientists had struggled for years to get computers to excel at the game. "                                                                                          
 [6] " In Go, two players alternate placing white and black stones on a grid. The goal is to claim the most territory. To do so, you surround your opponent's pieces so that they're removed from the board. "                                                                                                             
 [7] " The board's 19-by-19 grid is so vast that it allows a near infinite combination of moves, making it tough for machines to comprehend. Games such as chess have come quicker to machines. "                                                                                                                          
 [8] " Related: Elon Musk's new plan to save humanity from AI "                                                                                                                                                                                                                                                            
 [9] " The Google engineers at DeepMind rely on deep learning, a trendy form of artificial intelligence that's driving remarkable gains in what computers are capable of. World-changing technologies that loom on the horizon, such as autonomous vehicles, rely on deep learning to effectively see and drive on roads. "
[10] " AlphaGo's achievement is also a reminder of the steady improvement of machines' ability to complete tasks once reserved for humans. As machines get smarter, there are concerns about how society will be disrupted, and if all humans will be able to find work. "                                                 
[11] " Historically, mankind's development of tools has always created new jobs that never existed before. But the gains in artificial intelligence are coming at a breakneck pace, which will likely accentuate upheaval in the short term. "                                                                             
[12] " Related: Google uses AI to help diagnose breast cancer "                                                                                                                                                                                                                                                            
[13] " The 19-year-old Ke and AlphaGo will play a third match Saturday morning. The summit will also feature a match Friday in which five human players will team up against AlphaGo. "      

最佳

科林

答案 1 :(得分:0)

随着@Colin提供的想法,我试着按照原始代码。因此,对于包含RJSONIO

的JSON文件中的动态内容,我编写如下代码
url <- "https://search.xxxxxx.io/content?q=AlphaGo"
content <- fromJSON(url)
content1 <- content$result
content_result <- matrix(NA,10,5)
for(i in 1:length(content1)){
  content_result[i,] <- c("CNN", content1[[i]]$firstPublishDate,ifelse(class(content1[[i]]$headline) != "NULL",content1[[i]]$headline,"NA"),
                         content1[[i]]$body,content1[[i]]$url)
}