使用R同时从多个PDF提取多个短语

时间:2019-10-02 13:05:34

标签: r for-loop lapply pdftools

我在一张表中列出了pdf途径,并且我尝试对列出的pdf的其余部分重复以下命令。基本上,我仅将pdf文件转换为该文件首页的文本,然后使用keyword_search命令对该页面中的某些短语进行搜索。我一次可以成功完成一个文件,但是我有281个文件。我想念什么?

一个PDF文件

    my.file<-"//.../cover-letter.pdf"
    my.page<-pdf_text(my.file)[1] %>% as.character()
    my.result<-keyword_search(my.page, keyword = c('reason','not being marketed', 'available for sale', 'withdrawn from sale', 'commercial distribution', 'target date'), ignore_case = TRUE)
    my.result$Cover_Letter<-my.file

    my.result<-select(my.result, -5)
    result<-merge(TotNoMark_clean, my.result, by = "Cover_Letter", all.x = TRUE)

多个PDF文件:尝试失败


DF<-as.data.frame(TotNoMark_clean)
file.names<-DF$Cover_Letter

for(i in 1:length(file.names)){
  {pdf_pages<-pdf_text(file.names[i])[1]
  pdf_result<-keyword_search(pdf_pages, keyword = c('reason','not being marketed', 'available for sale', 'withdrawn from sale', 'commercial distribution', 'target date'))
  pdf_result$Cover_Letter<-file.names[i]
  if (!nrow(pdf_result)) {next}
  }
  Result<<-pdf_result
}
Result<-select(Result, -5)
Result<-merge(DF, Result, by = "Cover_Letter", all.x = TRUE)

这是我收到的错误消息:

    "Error in `$<-.data.frame`(`*tmp*`, "Cover_Letter", value = "//cover-letters/***.pdf") : 
  replacement has 1 row, data has 0"

2 个答案:

答案 0 :(得分:0)

当前,即使您使用作用域运算符<<-,您的 Result 也永远不会仅保留最后一个迭代,因为您不使用列表或循环增长对象(后者是不明智的)。实际上,您确实需要<<-,因为for循环不是在本地对象上运行,而是在全局对象上运行。如果最后一项的行为空,则next将导致 Result 为空。

考虑构建数据帧列表,然后在循环外运行bind_rows以获得最终输出:

DF <- as.data.frame(TotNoMark_clean)
# INITIALIZE EMPTY LIST
Result_dfs <- vector(mode="list", length=nrow(DF))

for(i in seq_along(DF$Cover_Letter)) {
  pdf_pages <- pdf_text(DF$Cover_Letter[i])[1]
  pdf_result <- keyword_search(pdf_pages, 
                               keyword = c('reason','not being marketed', 'available for sale', 
                                           'withdrawn from sale', 'commercial distribution', 
                                           'target date'))
  pdf_result$Cover_Letter <- DF$Cover_Letter[i]

  # SAVE TO LIST REGARDLESS OF NROWs 
  Result_dfs[i] <- pdf_result
}

# BIND ALL DFs TOGETHER AND SELECT LAST FIVE COLS
Result <- dplyr::select(dplyr::bind_rows(Result_dfs), -5)

# MERGE TO ORIGINAL
Result <- merge(DF, Result, by = "Cover_Letter", all.x = TRUE)

或者,使用lapply避免簿记初始化列表和分配列表项:

DF <- as.data.frame(TotNoMark_clean)

Result_dfs <- lapply(DF$Cover_Letter, function(f) {
    pdf_pages <- pdf_text(f)[1]
    pdf_result <- keyword_search(pdf_pages, 
                                 keyword = c('reason','not being marketed', 'available for sale', 
                                             'withdrawn from sale', 'commercial distribution', 
                                             'target date'))
    pdf_result$Cover_Letter <- f
    return(pdf_result)
})

# BIND ALL DFs TOGETHER AND SELECT LAST FIVE COLS
Result <- dplyr::select(dplyr::bind_rows(Result_dfs), -5)

# LEFT JOIN TO ORIGINAL
Result <- dplyr::left_join(DF, Result, by="Cover_Letter")

答案 1 :(得分:0)

在我检查并确保正确的字段位于正确的类中之后,这就是我最终要执行的操作,并且有效:

PhrasePull<-function(){
DF<-as.data.frame(TotNoMark_clean)
file.names<-DF$Cover_Letter
Result<-data.frame()
for(i in 1:length(file.names)){
    {pdf_pages<-pdf_text(file.names[i])[1]
    pdf_result<-keyword_search(pdf_pages, keyword = c('reason','not being marketed', 'has not marketed', 'will be able to market', 'will market', 'is not marketing', 'available for sale', 'withdrawn from sale', 'commercial marketing', 'commercial distribution', 'target date', 'will be available', 'marketing of this product has been started', 'commercially marketed', 'discontinued', 'launch.', 'not currently marketed', 'unable to market', 'listed in the active section of the Orange Book', 'not currently being manufactured or marketed'), ignore_case = TRUE)
    if (!nrow(pdf_result)) {next}
    pdf_result$Cover_Letter<-file.names[i]
    }
  Result <- bind_rows(Result, pdf_result)
  }
output<<-merge(DF, Result, by = "Cover_Letter", all.x = TRUE)
}