无法抓新闻网站

时间:2016-11-21 20:01:59

标签: r web-scraping

我正在从以下newsfeed rss创建数据集 http://indianexpress.com/section/india/feed/

我正在阅读此xml中的以下数据

  • 标题
  • title url
  • pub date

我现在使用标题网址来获取描述(概要,在主标题下方) - 通过点击每个网址并抓取数据

然而,我在向量长度(197)中面临与其他人(即200)的描述不匹配。 因此,我无法创建我的数据框

有人可以帮助我如何有效地抓取数据

以下代码是可重现的

library("httr")
library("RCurl")
library("jsonlite")
library("lubridate")
library("rvest")
library("XML")
library("stringr")

url = "http://indianexpress.com/section/india/feed/"

newstopics = getURL(url)

newsxml = xmlParse(newstopics)

title <- xpathApply(newsxml, "//item/title", xmlValue)
title <- unlist(title)

titleurl <- xpathSApply(newsxml, '//item/link', xmlValue)
pubdate <- xpathSApply(newsxml, '//item/pubDate', xmlValue)

t1 = Sys.time()
desc <- NULL

for (i in 1:length(titleurl)){

  page = read_html(titleurl[i])
  temp = html_text(html_nodes(page,'.synopsis'))
  desc = c(desc,temp)

}

print(difftime(Sys.time(), t1, units = 'sec'))

desc = gsub("\n",' ',desc)

newsdata = data.frame(title,titleurl,desc,pubdate)

我收到以下错误:

Error in data.frame(title, titleurl, desc, pubdate) : 
arguments imply differing number of rows: 200, 197

1 个答案:

答案 0 :(得分:0)

您可以执行以下操作:

// defining the array to loop over
const toFormat = [
  [1, 'one', 'unu'],
  [2, 'two', 'du'],
  [3, 'three', 'tri'],
  [4, 'four', 'kvar']
];

let formatted = "";
for (let i of toFormat) {
  formatted += (i[0] + " (" + i[1] + ")\n");
}
console.log(formatted);

为您提供包含4列的library(tidyverse) library(xml2) library(rvest) feed <- read_xml("http://indianexpress.com/section/india/feed/") # helper function to extract information from the item node item2vec <- function(item){ tibble(title = xml_text(xml_find_first(item, "./title")), link = xml_text(xml_find_first(item, "./link")), pubDate = xml_text(xml_find_first(item, "./pubDate"))) } dat <- feed %>% xml_find_all("//item") %>% map_df(item2vec) # The following takes a while dat <- dat %>% mutate(desc = map_chr(dat$link, ~read_html(.) %>% html_node('.synopsis') %>% html_text)) / data.frame

tibble

P.S。:要获得> glimpse(dat) Observations: 200 Variables: 4 $ title <chr> "Common man has no problem with note ban, says Santosh Gangwar", "Bombay High Court comes... $ link <chr> "http://indianexpress.com/article/india/india-news-india/demonetisation-note-ban-cash-cru... $ pubDate <chr> "Mon, 21 Nov 2016 20:04:21 +0000", "Mon, 21 Nov 2016 20:01:43 +0000", "Mon, 21 Nov 2016 1... $ desc <chr> "MoS for Finance speaks to Indian Express in Bareilly, his Lok Sabha constituency.", "The... 的所有信息,您可以使用:

item