我有以下代码使用R从网页提取内容。我想知道如何执行以下操作:

时间:2019-02-11 11:18:50

标签: r loops web-scraping text-extraction

  1. 如何为几个.csv文件运行相同的代码,其中每个文件包含大约50个URL?
  2. 以下代码提取了网页的全部内容。但我只想提取“百度”一词。 请帮忙。

#To request and retrieve the content from web server
> require(RCurl)
> mine<-getURL("https://www.clickz.com/search-in-china-how-baidu-is-different- from-google/36812/",ssl.verifypeer = FALSE)
> class(mine)
[1] "character"
> is.vector(mine)
[1] TRUE
> print(mine)

#to extract the main text of the page
> require(XML)
> mine.tree<-htmlTreeParse(mine,useInternal = TRUE)
> print(mine.tree)

#to extract the content of each paragraph
> mine.tree.parse<-unlist(xpathApply(mine.tree,path = "//p",fun = xmlValue))
> class(mine.tree.parse)
[1] "character"
> print(mine.tree.parse)

#To export to excel
> mine.txt<-NULL
> for(i in 2:(length(mine.tree.parse)-1)){ mine.txt<-paste(mine.txt,as.character(mine.tree.parse[1]),sep = '') }
> is.vector(mine.txt)
[1] TRUE
> length(mine.txt)
[1] 1
> print(mine.txt)
>write.table(dt, file="mydata.csv",sep=",",row.names=F)require(RCurl)

0 个答案:

没有答案