尽管存在文件,但从url读取数据时会出现“404 Not Found”错误

时间:2017-04-05 04:37:00

标签: r csv url http-status-code-404

我正在编写一个程序来收集this页面中的所有每日.csv文件。但是,对于某些文件,我收到错误消息:

Error in open.connection(file, "rt") : cannot open the connection
In addition: Warning message:
In open.connection(file, "rt") :
  cannot open URL 'https://www.eride.ri.gov/eride2K5/AggregateAttendance/Data/05042016_DailyAbsenceData.csv': HTTP status was '404 Not Found'

以下是2016年5月12日文件中的示例:

read.csv(url("https://www.eride.ri.gov/eride2K5/AggregateAttendance/Data/05122016_DailyAbsenceData.csv"))

奇怪的是,如果你去网站,找到该文件的链接并单击它,R不再提供错误并正确读取文件。这里发生了什么,如何在不必手动点击这些文件的情况下阅读这些文件? (注意,只有你们中的第一个能够复制问题,因为单击该文件会修复它以进行修复。)

最终,我想使用以下循环来收集所有文件:

# Create a vector of dates. This is the interval data is collected from. 
dates = seq(as.Date("2016-05-1"), as.Date("2016-05-30"), by="days")
# Format to match the filename prefixes
dates = strftime(dates, '%m%d%Y')
# Create the vector of a file names I want read. 
file.names = paste(dates,"_DailyAbsenceData.csv", sep = "")

# A loop that reads the .csv files into a list of data frame
daily.truancy = list()
for (i in 1:length(dates)) {
  tryCatch({ #this function prevents the loop from stopping from an error when read.csv cannot access the file
    daily.truancy[[i]] = read.csv(url(paste("https://www.eride.ri.gov/eride2K5/AggregateAttendance/Data/", file.names[i], sep = "")), sep = ",")
    stop("School day") #this indicates that the file was successfully read in to the list
  }, error=function(e){cat("ERROR :",conditionMessage(e), "\n")})
}

# Unlist the daily data to a large panel
daily.truancy.2016 <- do.call("rbind", daily.truancy)

请注意,实际上没有文件(周末)时会显示相同的错误消息。这不是问题。

1 个答案:

答案 0 :(得分:1)

由于页面是动态生成的,因此url函数不起作用,但明确设计RSelenium就是这样的任务。

我要感谢@jdharrison这个精湛的套餐以及他对挑战性问题的回答,请看他的answers page     更多例子。

此处说明了基本设置步骤:RSelenium Setup

要提取我们感兴趣的elementID,最简单的方法是右键单击元素并单击chrome中的“Inspect”,我不确定其他浏览器,它们应该具有可能不同名称的类似功能

这将打开一个包含所选元素的html标签的侧窗口。

library(RSelenium)
RSelenium:::startServer()

#you can replace browser name with your version e.g. firefox

remDr <- remoteDriver(browserName = "chrome")
remDr$open(silent = TRUE)

appURL <- 'https://www.eride.ri.gov/eride2K5/AggregateAttendance/AttendanceReports.aspx'


monthYearCounter = 1

#total months to download
totalMonths = 2 

remDr$navigate(appURL)


for(monthYearCounter in 1:totalMonths) {


#Active monthYear on the page e.g April 2017
monthYearElem = remDr$findElement("xpath", "//td[contains(@style,'width:70%')]")

#highlights the element in yellow for visual feedback
monthYearElem$highlightElement()

#extract text
monthYearText = unlist(monthYearElem$getElementAttribute("innerHTML"))

cat(paste0("Processing month year=",monthYearText,"\n"))



# For a particular month all the CSV files are listed in a table



#extract elementID of all CSV files using the pattern "imgBtnXls"
csvFilesElemList = remDr$findElements("xpath", "//input[contains(@id,'imgBtnXls')]")


#For all elements, enable click function and save file to default download location
#Ensure delay between consecutive requests from burdening the servers

lapply(csvFilesElemList,function(x) {

#
x$clickElement()

#Be nice, do no overload servers with rapid requests!!

Sys.sleep(60)

})



#Go to previous month

remDr$findElement("xpath", "//a[contains(@title,'Go to the previous month')]")$clickElement()


}