将pdf(带特殊字符)转换为文本

时间:2017-10-01 03:23:34

标签: r pdf text text-analysis

您好我正在尝试将多个pdf转换为文本,我的代码正在运行,但是我的大部分文件都是西班牙语,其中包含(ñ,í,ó,ú,é)等字符(ñ,í) ,ó,ú,é)正在腐败。此外,我需要文本文件为小写,以便稍后进行文本分析:

library(XML)
  library(httr)
  library(dplyr)
  library(tidyr)
  library(stringr)
  library(tm)

  # Get a list of all of the document names of the downloaded PDFs
    pdf_files <- list.files(path = paste(getwd(), '/pdf', sep = ''),
                            pattern = 'pdf',
                            full.names = TRUE)

    # Check there are pdf files in directory
    if( length(pdf_files) > 0 ){

      # Loop through each PDF and create a txt version in the same folder

      for(i in pdf_files){

        system(
          paste(
            paste('"', getwd(), '/dependencies/xpdf/bin64/pdftotext.exe"', sep = ''), 
            paste0('"', i, '"')),
          wait = FALSE)

      }
    }


  cat( '\nConversion to text complete.\n\n' )

0 个答案:

没有答案