Rvest:通过网址循环只返回元素

时间:2017-09-29 23:19:39

标签: r for-loop web-scraping rvest

所以这就是我的情况:我已经通过一系列网址获得了很多成功,通常是通过抓取href(s)并将它们附加到域来创建的。这是我在这里使用的策略



data = list()
for(i in 1:length(classes)){
  
  course <- read_html(classes[i])
  
  title <- course%>%
    html_node('h1')%>%
    html_text()
    
  description <- course%>%
    html_node('.block_content')%>%
    html_text()
  
  data[[length(data) + 1]] <- list(Title=title, Description=description)
}
&#13;
&#13;
&#13;

类是一堆看起来像这样的字符串(切断结尾并开始,因为它们是链接而我没有代表)

   [1] "ttp://catalog.pomona.edu/preview_course_nopop.php?catoid=" 
   [2] "ttp://catalog.pomona.edu/preview_course_nopop.php?catoid=" 
   [3] "ttp://catalog.pomona.edu/preview_course_nopop.php?catoid="
   [4] "ttp://catalog.pomona.edu/preview_course_nopop.php?catoid=" 
   [5] "ttp://catalog.pomona.edu/preview_course_nopop.php?catoid="
   ...
   [2340] "ttp://catalog.pomona.edu/preview_course_nopop.php?catoid"

单独测试链接时没有问题;如果我请求特定的URL而不是整个索引,循环也将正常运行。但是,如果我在整个类的长度上运行它,它会运行很长时间并只返回一个结果

> description
[1] "\n                  \n                      \t\t\t\t\t\tHELP\n\t\t\t\t\t\t2017-2018 Pomona College Catalog Print-Friendly Page [Add to Portfolio]                      \n                    THEA199IRPO - Theatre: Independent ResearchWhen Offered: Each semester.Instructor(s): StaffCredit: 0.5-1A substantial and significant piece of original research or creative product produced. Prerequisite course work required. Available for full or half-course credit.  Back to Top | Print-Friendly Page [Add to Portfolio]                  "
> title
[1] "THEA199IRPO - Theatre: Independent Research"

我老老实实地考虑到a)我之前已经成功了,b)链接没有被打破。我也没有收到任何错误消息。任何帮助都非常欢迎!

0 个答案:

没有答案
相关问题